You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@storm.apache.org by "P. Taylor Goetz" <pt...@gmail.com> on 2014/02/25 23:28:33 UTC

[DISCUSS] Pulling "Contrib" Modules into Apache

A while back I opened STORM-206 [1] to capture ideas for pulling in “contrib” modules to the Apache codebase.

In the past, we had the storm-contrib github project [2] which subsequently got broken up into individual projects hosted on the stormprocessor github group [3] and elsewhere.

The problem with this approach is that in certain cases it led to code rot (modules not being updated in step with Storm’s API), fragmentation (multiple similar modules with the same name), and confusion.

A good example of this is the storm-kafka module [4], since it is a widely used component. Because storm-contrib wasn’t being tagged in github, a lot of users had trouble reconciling with which versions of storm it was compatible. Some users built off specific commit hashes, some forked, and a few even pushed custom builds to repositories such as clojars. With kafka 0.8 now available, there are two main storm-kafka projects, the original (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka 0.8).

My intention is not to find fault in any way, but rather to point out the resulting pain, and work toward a better solution.

I think it would be beneficial to the Storm user community to have certain commonly used modules like storm-kafka brought into the Apache Storm project. Another benefit worth considering is the licensing/legal oversight that the ASF provides, which is important to many users.

If this is something we want to do, then the big question becomes what sort governance process needs to be established to ensure that such things are properly maintained.

Some random thoughts, questions, etc. that jump to mind include:

What to call these things: “contib modules”, “connectors”, “integration modules”, etc.?
Build integration: I imagine they would be a multi-module submodule of the main maven build. Probably turned off by default and enabled by a maven profile.
Governance: Have one or more committer volunteers responsible for maintenance, merging patches, etc.? Proposal process for pulling new modules?

I look forward to hearing others’ opinions.

- Taylor

[1] https://issues.apache.org/jira/browse/STORM-206
[2] https://github.com/nathanmarz/storm-contrib
[3] https://github.com/stormprocessor
[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Mattjis,

That’s awesome that the NFI might be willing to contribute source code to Storm!

I hope I didn’t come across as suggesting that pulling in HolmesNL/kafka-spout was not an option. That’s up to the Storm community to decide. The point I was trying to make was that contributions from corporate entities require additional steps. That’s especially true when the contribution is an entire codebase (vs. small patches to an existing codebase). 

The fact that the project is already Apache v2-licensed facilities part of the process.

- Taylor

On Mar 5, 2014, at 6:59 AM, Mattijs Ugen (DT) <ma...@holmes.nl> wrote:

>> storm-kafka will be somewhat complicated by the fact that
>> storm-kafka-0.8-plus was forked from the original source without commit
>> history. We’ll also have to figure out both if to and how to maintain
>> compatibility with two versions of kafka. I’ll propose starting with the
>> original storm-kafka, preserving commit history, and we can work from
>> there. As I mentioned previously, the author of storm-kafka-0.8-plus is
>> willing to help out.
>> 
>> While I agree that https://github.com/HolmesNL/kafka-spout is worthy of
>> consideration, it’s a little more complicated from an IP clearance
>> perspective. For that to be an option, I believe the Netherlands
>> Forensics Institute (the entity owning the IP), would have to donate it
>> to the ASF and go through a formal IP clearance process.
> The reason we put the source for our take on the concept on github was
> twofold:
> 
> - provide others with something that's worked well for us
> - allow us to benefit from efforts outside our own team in improving our
> component
> 
> If the ASF is interested in importing it, I'd personally say that would
> fit our goals for the project. My personal opinion aside, I'll go find
> someone who would be able to say something on the matter from the NFI's
> point of view. The license should be compatible with such a step I
> think, I'll get back to you on the rest of this when I know more.
> 
> Kind regards,
> 
> Mattijs Ugen
> --
> Netherlands Forensic Institute
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Mattijs Ugen (DT)" <ma...@holmes.nl>.

> storm-kafka will be somewhat complicated by the fact that
> storm-kafka-0.8-plus was forked from the original source without commit
> history. We’ll also have to figure out both if to and how to maintain
> compatibility with two versions of kafka. I’ll propose starting with the
> original storm-kafka, preserving commit history, and we can work from
> there. As I mentioned previously, the author of storm-kafka-0.8-plus is
> willing to help out.
>
> While I agree that https://github.com/HolmesNL/kafka-spout is worthy of
> consideration, it’s a little more complicated from an IP clearance
> perspective. For that to be an option, I believe the Netherlands
> Forensics Institute (the entity owning the IP), would have to donate it
> to the ASF and go through a formal IP clearance process.
The reason we put the source for our take on the concept on github was
twofold:

- provide others with something that's worked well for us
- allow us to benefit from efforts outside our own team in improving our
component

If the ASF is interested in importing it, I'd personally say that would
fit our goals for the project. My personal opinion aside, I'll go find
someone who would be able to say something on the matter from the NFI's
point of view. The license should be compatible with such a step I
think, I'll get back to you on the rest of this when I know more.

Kind regards,

Mattijs Ugen
--
Netherlands Forensic Institute

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Thanks Michael and everyone else who participated in this discussion. It has been very constructive and raised some excellent points regarding not just the “contrib” module topic, but also how the project overall can be improved for both users and developers.

I don’t think it would reasonable (or advisable) to try to tackle everything at once, so I think it best to work it out on a piece-by-piece, case-by-case basis. I have a git repo that has most of these modules incorporated (with full commit history) and integrated into the maven build. It wouldn’t be hard to create separate pull requests for each, so they can be discussed and merged (or not) independently.

It might just be me, but in my experience project “contrib” directories can tend to be somewhat of a wild west in terms of how well they are maintained. Since one of the goals here is to make sure everything added becomes a first-class citizen within the project, I’m leaning toward using a different name. What do others think? Any thoughts on a different name?

There seems to be consensus that storm-starter and storm-kafka be brought in, so I will start there. I’ll open a pull request to bring storm-starter into an “examples” directory.

storm-kafka will be somewhat complicated by the fact that storm-kafka-0.8-plus was forked from the original source without commit history. We’ll also have to figure out both if to and how to maintain compatibility with two versions of kafka. I’ll propose starting with the original storm-kafka, preserving commit history, and we can work from there. As I mentioned previously, the author of storm-kafka-0.8-plus is willing to help out.

While I agree that https://github.com/HolmesNL/kafka-spout is worthy of consideration, it’s a little more complicated from an IP clearance perspective. For that to be an option, I believe the Netherlands Forensics Institute (the entity owning the IP), would have to donate it to the ASF and go through a formal IP clearance process.

One final note regarding Michael’s “Have Storm up and running faster than you can brew an espresso” note: Personally I think vagrant [1] is awesome for this purpose and I use it heavily for testing Storm patches, releases, etc. I while back I made the project available on github [2], I’ve just been somewhat neglectful of pushing branches and enhancements. But it’s awesome to be able to go from zero to a running storm cluster in a matter of minutes. I did something similar with Apache Whirr [3][4], but in my opinion some of the nice things about vagrant is it (and Virtualbox) is free, and if you forget and leave your clusters running, your credit card won't get dinged. (N.B.: I’m not suggesting any of the mentioned projects necessarily get pulled in, but that something along those lines could be really helpful for new users.)

- Taylor

[1] http://www.vagrantup.com
[2] https://github.com/ptgoetz/storm-vagrant
[3] https://github.com/ptgoetz/whirr-storm
[4] https://github.com/ptgoetz/whirr-kafka


On Mar 1, 2014, at 5:11 AM, Michael G. Noll <mi...@michael-noll.com> wrote:

> Thanks for starting this discussion, Taylor.
> 
> As a user of Storm (and a small-scale contributor to storm-starter) as
> well as a user of Kafka, here are my $.02.
> 
> [Storm and Kafka]
> First, I agree with Nathan that storm-kafka should be considered to be
> brought in.  While various "integrate Storm with X" options exist,
> basically everyone I have been talking to is using Kafka in
> combination with Storm.  I'm sure this is not a representative sample
> of Storm users, and of course one may or may not agree that Kafka is
> important enough of a technology in Storm's ecosystem.  Still, I do
> see the need to make sure Storm and Kafka do work together without
> having to go through forks of forks on GitHub and spending days to
> figure out how to get data from Kafka (0.8) into Storm.
>    Speaking of Kafka spout implementations, please don't forget
> https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> We've been quite happy with the former, so I'd suggest to at least
> consider both options here (maybe the two projects can even join forces?).
> 
> [Storm examples, storm-starter]
> Second, IMHO every open source project should have a "1-click starting
> experience" for new users.  That's very much related to the project
> principles of tools like LogStash [1] who say: "Community: If a newbie
> has a bad time, it's a bug."  For this reason I personally would like
> to see the equivalent of storm-starter being brought into the "core"
> Storm project -- think of an examples/ sub-module.  If the level of
> effort is deemed too high to e.g. maintain what's already in
> storm-starter, then (say) reduce the scope and remove some of the
> examples.  In any case I'd personally would like to see bundled
> examples that are known to work with the latest version of Storm.
> storm-starter is often used to show new users how to get started with
> Storm (I used that approach in my Storm blog posts, for instance, and
> others like Mesosphere.io are even using storm-starter for their
> commercial offerings [2]).
> 
> [Have Storm up and running faster than you can brew an espresso]
> Third, for the same reason (get people up and running in a few
> minutes), I do like that other people in this thread have been
> bringing up projects like storm-deploy.  For the same reason I have
> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> few days ago, and I'll soon open source another Vagrant/Puppet based
> tool that provides you with 1-click local and remote deployments of
> Storm and Kafka clusters.  That's way better IMHO than having to
> follow long articles or blog posts to deploy your first cluster.  And
> there are a number of other people that have been rolling their own
> variants.  Now don't get me wrong -- I don't mention this to pitch any
> of those tools.  My intention is to say that it would be greatly
> helpful to have /something/ like this for Storm, for the same reason
> that it's nice to have LocalCluster for unit testing.  I have been
> demo'ing both Storm and Kafka by launching clusters with a simple
> command line, which always gets people excited.  If they can then rely
> on existing examples (see above) to also /run/ an analysis on "their"
> cluster then they have a beautiful start.
>    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> VM cluster setup, too [4] so that people can run the Aurora tutorial
> on their machines in a few minutes.
> 
> [Storm and YARN]
> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> would be nice.  It ties into being able to run LocalCluster as well as
> to run Storm in local or remote VMs -- but now alongside your existing
> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> will surely be similarly attractive.
> 
> 
> On a related note bringing the Storm docs up to speed with the quality
> of the Storm code would also be great.  I have seen that since Storm
> moved to Incubator several new sections have been added such as the
> FAQ [5] (btw: nice!).
> 
> Similarly, there should be better examples and docs for users how to
> write unit tests for Storm.  Right now people seem to be cobbling
> together their test code by figuring out how the 1-year old code in
> [6] actually works, and copy-pasting other people's test code from GitHub.
> 
> --
> 
> As I said above, these are my personal $.02.  I admit that my comments
> go a bit beyond the original question of bringing in contrib modules
> -- it think implicitly the discussion about the contrib modules also
> means "what do you need to provide a better and more well-rounded
> experience", i.e. the question whether to have batteries included or
> not. (As you may suspect I'm leaning towards included at least the
> most important batteries, though what's really "important" for on the
> project-level is of course up to debate.)
> 
> On my side I'd be happy to help with those areas where I am able to
> contribute, whether that's code and examples (like storm-starter) or
> tutorials/docs (I already wrote e.g. [7] and [8]).
> 
> Again, thanks Taylor for starting this discussion.  No matter the
> actual outcome I'm sure the state of the project will be improved.
> 
> Best,
> Michael
> 
> 
> 
> [1] https://github.com/elasticsearch/logstash
> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> [3] https://github.com/miguno/puppet-storm
> [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> [5] http://storm.incubator.apache.org/documentation/FAQ.html
> [6]
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> [7]
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> [8]
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> 
> 
> 
> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> Thanks for the feedback Bobby.
>> 
>> To clarify, I’m mainly talking about spout/bolt/trident state 
>> implementations that integrate storm with *Technology X*, where 
>> *Technology X* is not a fundamental part of storm.
>> 
>> Examples would be technologies that are part of or related to the 
>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>> Kafka, HDFS, HBase, Cassandra, etc.
>> 
>> The idea behind having one or more Storm committers act as a
>> “sponsor” is to make sure new additions are done carefully and with
>> good reason. To add a new module, it would require committer/PPMC
>> consensus, and assignment of one or more sponsors. Part of a
>> sponsor’s job would be to ensure that a module is maintained, which
>> would require enough familiarity with the code so support it long
>> term. If a new module was proposed, but no committers were willing
>> to act as a sponsor, it would not be added.
>> 
>> It would be the Committers’/PPMC’s responsibly to make sure things 
>> didn’t get out of hand, and to do something about it if it does.
>> 
>> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> Hive as a contrib module, similar to what happened with HBase as
>> Bobby pointed out. Some interesting points are brought up. The
>> difference here is that both HBase and Hive were pretty big
>> codebases relative to Hadoop. With spout/bolt/state implementations
>> I doubt we’d see anything along that scale.
>> 
>> - Taylor
>> 
>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> 
>> 
>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
>> <ma...@yahoo-inc.com>> wrote:
>> 
>>> I can see a lot of value in having a distribution of storm that
>>> comes with batteries included, everything is tested together and
>>> you know it works.  But I don’t see much long term developer
>>> benefit in building them all together.  If there is strong
>>> coupling between storm and these external projects so that they
>>> break when storm changes then we need to understand the coupling
>>> and decide if we want to reduce that coupling by stabilizing
>>> APIs, improving version numbering and release process, etc.; or
>>> if the functionality is something that should be offered as a
>>> base service in storm.
>>> 
>>> I can see politically the value of giving these other projects a
>>> home in Apache, and making them sub-projects is the simplest
>>> route to that. I’d love to have storm on yarn inside Apache.  I
>>> just don’t want to go overboard with it.  There was a time when
>>> HBase was a “contrib” module under Hadoop along with a lot of
>>> other things, and the Apache board came and told Hadoop to brake
>>> it up.
>>> 
>>> Bringing storm-kafka into storm does not sound like it will solve
>>> much from a developer’s perspective, because there is at least as
>>> much coupling with kafka as there is with storm.  I can see how
>>> it is a huge amount of overhead and pain to set up a new project
>>> just for a few hundred lines of code, as such I am in favor of
>>> pulling in closely related projects, especially those that are
>>> spouts and state implementations. I just want to be sure that we
>>> do it carefully, with a good reason, and with enough people who
>>> are familiar with the code to support it long term.
>>> 
>>> If it starts to look like we are pulling in too many projects
>>> perhaps we should look at something more like the bigtop project 
>>> https://bigtop.apache.org/ which produces a tested distribution
>>> of Hadoop with many different sub-projects included in it.
>>> 
>>> I am also a bit concerned about these sub-projects becoming
>>> second class citizens, where we break something, but because the
>>> build is off by default we don’t know it.  I would prefer that
>>> they are built and tested by default.  If the build and test time
>>> starts to take too long, to me that means we need to start
>>> wondering if we have too many contrib modules.
>>> 
>>> —Bobby
>>> 
>>> From: Brian Enochson <brian.enochson@gmail.com 
>>> <ma...@gmail.com>>
>>> 
>>> 
> Reply-To: "user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Date: Tuesday, February 25, 2014 at 9:50 PM
>>> To: "user@storm.incubator.apache.org 
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Cc: "dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>> 
>>> hi, I am in agreement with Taylor and believe I understand his
>>> intent. An incredible tool/framework/application like Storm is
>>> only enhanced and gains value from the number of well maintained
>>> and vetted modules that can be used for integration and adding
>>> further functionality. I am relatively new to the Storm community
>>> but have spent quite some time reviewing contributing modules out
>>> there, reviewing various duplicates and running into some version
>>> incompatibilities. I understand the need to keep Storm itself
>>> pure, but do think there needs to be some structure and
>>> governance added to the contributing modules. Look at the benefit
>>> a tool like npm brings to the node community. I like the idea of
>>> sponsorship, vetting and a community vote.  I, as sure many would
>>> be, am willing to offer support and time to working through how
>>> to set this up and helping with the implementation if it is
>>> decided to pursue some solution. I hope these views are taken in
>>> the sprit they are made, to make this incredible system even
>>> better along with the surrounding eco-system.
>>> 
>>> Thanks, Brian
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote: Just
>>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>>> suggesting that whatever a “contrib” project/module/subproject
>>> might become, be a clearinghouse for anything Storm-related.
>>> 
>>> I see it as something that is well-vetted by the Storm
>>> community, subject to PPMC review, vote, etc. Entry would require
>>> community review, PPMC review, and in some cases ASF IP
>>> clearance/legal review. Anything added would require some level
>>> of commitment from the PPMC/committers to provide some level of
>>> support.
>>> 
>>> In other words, nothing “willy-nilly”.
>>> 
>>> One option could be that any module added require (X > 0)  number
>>> of committers to volunteer as “sponsor”s for the module, and
>>> commit to maintaining it.
>>> 
>>> That being said, I don’t see storm-kafka being any different
>>> from anything else that provides integration points for Storm.
>>> 
>>> -Taylor
>>> 
>>> 
>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>>> <ma...@nathanmarz.com>>
>>> wrote:
>>> 
>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>> projects put these contrib modules in a "contrib" folder and keep
>>> them managed as completely separate codebases. As it's not
>>> actually a "module" necessary for Storm, there's an argument
>>> there for doing it that way rather than via the multi-module
>>> route.
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>>> <mpathira@umail.iu.edu 
>>> <ma...@umail.iu.edu>>
>>> wrote: Hi Taylor,
>>> 
>>> I'm +1 for pulling these external libraries into Apache codebase.
>>> This will certainly benifit Strom community. I also like to
>>> contribute to this process.
>>> 
>>> Thanks Milinda
>>> 
>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>> pulling in "contrib" modules to the Apache codebase.
>>>> 
>>>> In the past, we had the storm-contrib github project [2] which 
>>>> subsequently got broken up into individual projects hosted on
>>>> the stormprocessor github group [3] and elsewhere.
>>>> 
>>>> The problem with this approach is that in certain cases it led
>>>> to code rot (modules not being updated in step with Storm's
>>>> API), fragmentation (multiple similar modules with the same
>>>> name), and confusion.
>>>> 
>>>> A good example of this is the storm-kafka module [4], since it
>>>> is a widely used component. Because storm-contrib wasn't being
>>>> tagged in github, a lot of users had trouble reconciling with
>>>> which versions of storm it was compatible. Some users built off
>>>> specific commit hashes, some forked, and a few even pushed
>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>> now available, there are two main storm-kafka projects, the
>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>> (compatible with kafka 0.8).
>>>> 
>>>> My intention is not to find fault in any way, but rather to
>>>> point out the resulting pain, and work toward a better
>>>> solution.
>>>> 
>>>> I think it would be beneficial to the Storm user community to
>>>> have certain commonly used modules like storm-kafka brought
>>>> into the Apache Storm project. Another benefit worth
>>>> considering is the licensing/legal oversight that the ASF
>>>> provides, which is important to many users.
>>>> 
>>>> If this is something we want to do, then the big question
>>>> becomes what sort governance process needs to be established to
>>>> ensure that such things are properly maintained.
>>>> 
>>>> Some random thoughts, questions, etc. that jump to mind
>>>> include:
>>>> 
>>>> What to call these things: "contib modules", "connectors",
>>>> "integration modules", etc.? Build integration: I imagine they
>>>> would be a multi-module submodule of the main maven build.
>>>> Probably turned off by default and enabled by a maven profile. 
>>>> Governance: Have one or more committer volunteers responsible
>>>> for maintenance, merging patches, etc.? Proposal process for
>>>> pulling new modules?
>>>> 
>>>> 
>>>> I look forward to hearing others' opinions.
>>>> 
>>>> - Taylor
>>>> 
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>> https://github.com/stormprocessor [4]
>>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>> 
>>>> 
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

+1 on having both an examples/ and external/ directory. external/ is a
better name than other/


On Mon, Mar 17, 2014 at 9:50 AM, P. Taylor Goetz <pt...@gmail.com> wrote:

> I like the idea of using the name "external" as it more effectively
> communicates that the contents are not part of the core project.
>
> I also like having examples as a top-level directory to make it easier for
> users to find.
>
> - Taylor
>
> On Mar 16, 2014, at 7:02 AM, Michael G. Noll <
> michael+storm@michael-noll.com> wrote:
>
> > One further piece of food for thought:
> >
> > The Spark project has the following directory layout [1] in this regard:
> >
> > examples/
> > external/
> >    |
> >    +-- flume
> >    +-- kafka
> >    +-- mqtt
> >    +-- twitter
> >    +-- zeromq
> >
> > Note how 'kafka" is a connector to another OSS tool -- like
> > storm-kafka's spout -- where as 'twitter' is their implementation of
> > pulling data from Twitter's (streaming) API.  Of course, the 'kafka'
> > code similarly connects to an API, but there's is still a small
> > difference between 'twitter' (hosted API run by Twitter) and 'kafka'
> > (your own Kafka infrastructure).  Both sub-projects fit nicely under
> > 'external' though IMHO.
> >
> > As I said -- just further brainstorming.
> >
> > Michael
> >
> >
> >
> > [1] https://github.com/apache/incubator-spark
> >
> >
> >
> > On 03/14/2014 08:06 PM, Nathan Marz wrote:
> >> How about we make a folder under root called "other" in which everything
> >> non-core can go. We can do further subfolders if we want called
> "examples"
> >> and "connectors" - I don't care either way. I think this will first of
> all
> >> make it clear these things are not part of the core project, and it will
> >> also prevent the root of the source from getting cluttered with too much
> >> stuff.
> >>
> >>
> >> On Thu, Mar 13, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
> >>
> >>> Taylor,
> >>>
> >>> You guys have been doing a generally excellent job.  I was just
> chiming in
> >>> on the chance that there was doubt.
> >>>
> >>>
> >>> On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com>
> >>> wrote:
> >>>
> >>>> Thanks Ted,
> >>>>
> >>>> We're being very careful when pulling in additional code by taking
> steps
> >>>> to preserve commit history (chain of evidence), and when necessary,
> >>>> initiate the IP clearance process (haven't had to yet).
> >>>>
> >>>
> >>> Cool.
> >>>
> >>>
> >>>> The latter is kind of a gray area as far as I can tell from questions
> >>> I've
> >>>> asked on general@. It seems to be a judgment call based on the size
> of
> >>>> the contribution.
> >>>>
> >>>
> >>> It is exactly that.
> >>>
> >>>
> >>>>
> >>>> If there's anything else we can do to make sure we get these things
> >>> right,
> >>>> or do a better job, please let us know.
> >>>>
> >>>
> >>> So far, things are going swimmingly, due in no small part to your
> efforts.
> >>>
> >>>
> >>>
> >>>>
> >>>> -Taylor
> >>>>
> >>>>> On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com>
> >>> wrote:
> >>>>>
> >>>>> Having a committer sign off on each addition has a very large role at
> >>>>> Apache.  One of the key aspects of Apache software releases is that
> all
> >>>> of
> >>>>> the code is traceable back to the original contributor and there is a
> >>>>> logical chain that allows Apache to stand behind the licensing of the
> >>>> code.
> >>>>>
> >>>>> This licensing and chain of evidence is a big part of what makes open
> >>>>> source palatable to risk averse businesses.  It is really important
> to
> >>>>> maintain.
> >>>>>
> >>>>> Storm has a very good record of doing this before being part of
> Apache
> >>>>> which makes integration into Apache processes easier, but it is
> >>> important
> >>>>> to hang on to that careful approach.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <
> ptgoetz@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Exactly.
> >>>>>>
> >>>>>> That's why I proposed that anything that's brought in require at
> least
> >>>> on
> >>>>>> committer to "sponsor" it:
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >
>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

I like the idea of using the name “external” as it more effectively communicates that the contents are not part of the core project.

I also like having examples as a top-level directory to make it easier for users to find.

- Taylor

On Mar 16, 2014, at 7:02 AM, Michael G. Noll <mi...@michael-noll.com> wrote:

> One further piece of food for thought:
> 
> The Spark project has the following directory layout [1] in this regard:
> 
> examples/
> external/
>    |
>    +-- flume
>    +-- kafka
>    +-- mqtt
>    +-- twitter
>    +-- zeromq
> 
> Note how 'kafka" is a connector to another OSS tool -- like
> storm-kafka's spout -- where as 'twitter' is their implementation of
> pulling data from Twitter's (streaming) API.  Of course, the 'kafka'
> code similarly connects to an API, but there's is still a small
> difference between 'twitter' (hosted API run by Twitter) and 'kafka'
> (your own Kafka infrastructure).  Both sub-projects fit nicely under
> 'external' though IMHO.
> 
> As I said -- just further brainstorming.
> 
> Michael
> 
> 
> 
> [1] https://github.com/apache/incubator-spark
> 
> 
> 
> On 03/14/2014 08:06 PM, Nathan Marz wrote:
>> How about we make a folder under root called "other" in which everything
>> non-core can go. We can do further subfolders if we want called "examples"
>> and "connectors" - I don't care either way. I think this will first of all
>> make it clear these things are not part of the core project, and it will
>> also prevent the root of the source from getting cluttered with too much
>> stuff.
>> 
>> 
>> On Thu, Mar 13, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:
>> 
>>> Taylor,
>>> 
>>> You guys have been doing a generally excellent job.  I was just chiming in
>>> on the chance that there was doubt.
>>> 
>>> 
>>> On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com>
>>> wrote:
>>> 
>>>> Thanks Ted,
>>>> 
>>>> We're being very careful when pulling in additional code by taking steps
>>>> to preserve commit history (chain of evidence), and when necessary,
>>>> initiate the IP clearance process (haven't had to yet).
>>>> 
>>> 
>>> Cool.
>>> 
>>> 
>>>> The latter is kind of a gray area as far as I can tell from questions
>>> I've
>>>> asked on general@. It seems to be a judgment call based on the size of
>>>> the contribution.
>>>> 
>>> 
>>> It is exactly that.
>>> 
>>> 
>>>> 
>>>> If there's anything else we can do to make sure we get these things
>>> right,
>>>> or do a better job, please let us know.
>>>> 
>>> 
>>> So far, things are going swimmingly, due in no small part to your efforts.
>>> 
>>> 
>>> 
>>>> 
>>>> -Taylor
>>>> 
>>>>> On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com>
>>> wrote:
>>>>> 
>>>>> Having a committer sign off on each addition has a very large role at
>>>>> Apache.  One of the key aspects of Apache software releases is that all
>>>> of
>>>>> the code is traceable back to the original contributor and there is a
>>>>> logical chain that allows Apache to stand behind the licensing of the
>>>> code.
>>>>> 
>>>>> This licensing and chain of evidence is a big part of what makes open
>>>>> source palatable to risk averse businesses.  It is really important to
>>>>> maintain.
>>>>> 
>>>>> Storm has a very good record of doing this before being part of Apache
>>>>> which makes integration into Apache processes easier, but it is
>>> important
>>>>> to hang on to that careful approach.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com>
>>>> wrote:
>>>>>> 
>>>>>> Exactly.
>>>>>> 
>>>>>> That's why I proposed that anything that's brought in require at least
>>>> on
>>>>>> committer to "sponsor" it:
>>>> 
>>>> 
>>> 
>> 
>> 
>> 
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Michael G. Noll" <mi...@michael-noll.com>.

One further piece of food for thought:

The Spark project has the following directory layout [1] in this regard:

examples/
external/
    |
    +-- flume
    +-- kafka
    +-- mqtt
    +-- twitter
    +-- zeromq

Note how 'kafka" is a connector to another OSS tool -- like
storm-kafka's spout -- where as 'twitter' is their implementation of
pulling data from Twitter's (streaming) API.  Of course, the 'kafka'
code similarly connects to an API, but there's is still a small
difference between 'twitter' (hosted API run by Twitter) and 'kafka'
(your own Kafka infrastructure).  Both sub-projects fit nicely under
'external' though IMHO.

As I said -- just further brainstorming.

Michael



[1] https://github.com/apache/incubator-spark



On 03/14/2014 08:06 PM, Nathan Marz wrote:
> How about we make a folder under root called "other" in which everything
> non-core can go. We can do further subfolders if we want called "examples"
> and "connectors" - I don't care either way. I think this will first of all
> make it clear these things are not part of the core project, and it will
> also prevent the root of the source from getting cluttered with too much
> stuff.
> 
> 
> On Thu, Mar 13, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:
> 
>> Taylor,
>>
>> You guys have been doing a generally excellent job.  I was just chiming in
>> on the chance that there was doubt.
>>
>>
>> On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com>
>> wrote:
>>
>>> Thanks Ted,
>>>
>>> We're being very careful when pulling in additional code by taking steps
>>> to preserve commit history (chain of evidence), and when necessary,
>>> initiate the IP clearance process (haven't had to yet).
>>>
>>
>> Cool.
>>
>>
>>> The latter is kind of a gray area as far as I can tell from questions
>> I've
>>> asked on general@. It seems to be a judgment call based on the size of
>>> the contribution.
>>>
>>
>> It is exactly that.
>>
>>
>>>
>>> If there's anything else we can do to make sure we get these things
>> right,
>>> or do a better job, please let us know.
>>>
>>
>> So far, things are going swimmingly, due in no small part to your efforts.
>>
>>
>>
>>>
>>> -Taylor
>>>
>>>> On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>>>>
>>>> Having a committer sign off on each addition has a very large role at
>>>> Apache.  One of the key aspects of Apache software releases is that all
>>> of
>>>> the code is traceable back to the original contributor and there is a
>>>> logical chain that allows Apache to stand behind the licensing of the
>>> code.
>>>>
>>>> This licensing and chain of evidence is a big part of what makes open
>>>> source palatable to risk averse businesses.  It is really important to
>>>> maintain.
>>>>
>>>> Storm has a very good record of doing this before being part of Apache
>>>> which makes integration into Apache processes easier, but it is
>> important
>>>> to hang on to that careful approach.
>>>>
>>>>
>>>>
>>>>
>>>>> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com>
>>> wrote:
>>>>>
>>>>> Exactly.
>>>>>
>>>>> That's why I proposed that anything that's brought in require at least
>>> on
>>>>> committer to "sponsor" it:
>>>
>>>
>>
> 
> 
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Sean Zhong <cl...@gmail.com>.

+1 on Nathan's suggestion on name.




On Sat, Mar 15, 2014 at 3:06 AM, Nathan Marz <na...@nathanmarz.com> wrote:

> How about we make a folder under root called "other" in which everything
> non-core can go. We can do further subfolders if we want called "examples"
> and "connectors" - I don't care either way. I think this will first of all
> make it clear these things are not part of the core project, and it will
> also prevent the root of the source from getting cluttered with too much
> stuff.
>
>
> On Thu, Mar 13, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > Taylor,
> >
> > You guys have been doing a generally excellent job.  I was just chiming
> in
> > on the chance that there was doubt.
> >
> >
> > On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com>
> > wrote:
> >
> > > Thanks Ted,
> > >
> > > We're being very careful when pulling in additional code by taking
> steps
> > > to preserve commit history (chain of evidence), and when necessary,
> > > initiate the IP clearance process (haven't had to yet).
> > >
> >
> > Cool.
> >
> >
> > > The latter is kind of a gray area as far as I can tell from questions
> > I've
> > > asked on general@. It seems to be a judgment call based on the size of
> > > the contribution.
> > >
> >
> > It is exactly that.
> >
> >
> > >
> > > If there's anything else we can do to make sure we get these things
> > right,
> > > or do a better job, please let us know.
> > >
> >
> > So far, things are going swimmingly, due in no small part to your
> efforts.
> >
> >
> >
> > >
> > > -Taylor
> > >
> > > > On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > >
> > > > Having a committer sign off on each addition has a very large role at
> > > > Apache.  One of the key aspects of Apache software releases is that
> all
> > > of
> > > > the code is traceable back to the original contributor and there is a
> > > > logical chain that allows Apache to stand behind the licensing of the
> > > code.
> > > >
> > > > This licensing and chain of evidence is a big part of what makes open
> > > > source palatable to risk averse businesses.  It is really important
> to
> > > > maintain.
> > > >
> > > > Storm has a very good record of doing this before being part of
> Apache
> > > > which makes integration into Apache processes easier, but it is
> > important
> > > > to hang on to that careful approach.
> > > >
> > > >
> > > >
> > > >
> > > >> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <
> ptgoetz@gmail.com>
> > > wrote:
> > > >>
> > > >> Exactly.
> > > >>
> > > >> That's why I proposed that anything that's brought in require at
> least
> > > on
> > > >> committer to "sponsor" it:
> > >
> > >
> >
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

How about we make a folder under root called "other" in which everything
non-core can go. We can do further subfolders if we want called "examples"
and "connectors" - I don't care either way. I think this will first of all
make it clear these things are not part of the core project, and it will
also prevent the root of the source from getting cluttered with too much
stuff.


On Thu, Mar 13, 2014 at 4:16 PM, Ted Dunning <te...@gmail.com> wrote:

> Taylor,
>
> You guys have been doing a generally excellent job.  I was just chiming in
> on the chance that there was doubt.
>
>
> On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com>
> wrote:
>
> > Thanks Ted,
> >
> > We're being very careful when pulling in additional code by taking steps
> > to preserve commit history (chain of evidence), and when necessary,
> > initiate the IP clearance process (haven't had to yet).
> >
>
> Cool.
>
>
> > The latter is kind of a gray area as far as I can tell from questions
> I've
> > asked on general@. It seems to be a judgment call based on the size of
> > the contribution.
> >
>
> It is exactly that.
>
>
> >
> > If there's anything else we can do to make sure we get these things
> right,
> > or do a better job, please let us know.
> >
>
> So far, things are going swimmingly, due in no small part to your efforts.
>
>
>
> >
> > -Taylor
> >
> > > On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > >
> > > Having a committer sign off on each addition has a very large role at
> > > Apache.  One of the key aspects of Apache software releases is that all
> > of
> > > the code is traceable back to the original contributor and there is a
> > > logical chain that allows Apache to stand behind the licensing of the
> > code.
> > >
> > > This licensing and chain of evidence is a big part of what makes open
> > > source palatable to risk averse businesses.  It is really important to
> > > maintain.
> > >
> > > Storm has a very good record of doing this before being part of Apache
> > > which makes integration into Apache processes easier, but it is
> important
> > > to hang on to that careful approach.
> > >
> > >
> > >
> > >
> > >> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com>
> > wrote:
> > >>
> > >> Exactly.
> > >>
> > >> That's why I proposed that anything that's brought in require at least
> > on
> > >> committer to "sponsor" it:
> >
> >
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Ted Dunning <te...@gmail.com>.

Taylor,

You guys have been doing a generally excellent job.  I was just chiming in
on the chance that there was doubt.


On Thu, Mar 13, 2014 at 4:09 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Thanks Ted,
>
> We're being very careful when pulling in additional code by taking steps
> to preserve commit history (chain of evidence), and when necessary,
> initiate the IP clearance process (haven't had to yet).
>

Cool.


> The latter is kind of a gray area as far as I can tell from questions I've
> asked on general@. It seems to be a judgment call based on the size of
> the contribution.
>

It is exactly that.


>
> If there's anything else we can do to make sure we get these things right,
> or do a better job, please let us know.
>

So far, things are going swimmingly, due in no small part to your efforts.



>
> -Taylor
>
> > On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com> wrote:
> >
> > Having a committer sign off on each addition has a very large role at
> > Apache.  One of the key aspects of Apache software releases is that all
> of
> > the code is traceable back to the original contributor and there is a
> > logical chain that allows Apache to stand behind the licensing of the
> code.
> >
> > This licensing and chain of evidence is a big part of what makes open
> > source palatable to risk averse businesses.  It is really important to
> > maintain.
> >
> > Storm has a very good record of doing this before being part of Apache
> > which makes integration into Apache processes easier, but it is important
> > to hang on to that careful approach.
> >
> >
> >
> >
> >> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com>
> wrote:
> >>
> >> Exactly.
> >>
> >> That’s why I proposed that anything that’s brought in require at least
> on
> >> committer to “sponsor” it:
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Thanks Ted,

We're being very careful when pulling in additional code by taking steps to preserve commit history (chain of evidence), and when necessary, initiate the IP clearance process (haven't had to yet).

The latter is kind of a gray area as far as I can tell from questions I've asked on general@. It seems to be a judgment call based on the size of the contribution.

If there's anything else we can do to make sure we get these things right, or do a better job, please let us know.

-Taylor

> On Mar 13, 2014, at 4:03 PM, Ted Dunning <te...@gmail.com> wrote:
> 
> Having a committer sign off on each addition has a very large role at
> Apache.  One of the key aspects of Apache software releases is that all of
> the code is traceable back to the original contributor and there is a
> logical chain that allows Apache to stand behind the licensing of the code.
> 
> This licensing and chain of evidence is a big part of what makes open
> source palatable to risk averse businesses.  It is really important to
> maintain.
> 
> Storm has a very good record of doing this before being part of Apache
> which makes integration into Apache processes easier, but it is important
> to hang on to that careful approach.
> 
> 
> 
> 
>> On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
>> 
>> Exactly.
>> 
>> That’s why I proposed that anything that’s brought in require at least on
>> committer to “sponsor” it:
>> 
>>>>>>> The idea behind having one or more Storm committers act as a
>>>>>>> "sponsor" is to make sure new additions are done carefully and with
>>>>>>> good reason. To add a new module, it would require committer/PPMC
>>>>>>> consensus, and assignment of one or more sponsors. Part of a
>>>>>>> sponsor's job would be to ensure that a module is maintained, which
>>>>>>> would require enough familiarity with the code so support it long
>>>>>>> term. If a new module was proposed, but no committers were willing
>>>>>>> to act as a sponsor, it would not be added.
>> 
>> 
>> Perhaps a README in the directory stating such a policy would make it
>> clear that what’s in there is officially endorsed and maintained by the
>> PPMC.
>> 
>> So we would have a process in place to prevent the “anything and
>> everything” situation. We would only add things related to tech that is
>> widely used in conjunction with Storm (Kafka, Cassandra, HDFS, etc.).
>> 
>> 
>> I’d like to start the work for pulling in storm-kafka, I just need a
>> directory to put it in for now. Changing the name later is just one `mv`
>> command.
>> 
>> - Taylor
>> 
>> 
>>> On Mar 13, 2014, at 3:08 PM, Nathan Marz <na...@nathanmarz.com> wrote:
>>> 
>>> We also don't want to create the impression that anything and everything
>>> belongs in the Storm project itself. storm-kafka is special because Kafka
>>> works so well with Storm and is so widely used. But if there was a folder
>>> called "connectors" or "adapters", people may think we're willing to pull
>>> in anything and everything.
>>> 
>>> +1 for putting storm-starter in an examples/ directory.
>>> 
>>> 
>>> On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>
>> wrote:
>>> 
>>>> To clarify somewhat, the pull request for pulling in storm-starter [1]
>>>> puts it in an "examples" directory. And there are suggestions to pull
>>>> James' scheduler and testing examples in there as well. So there is a
>>>> distinction between examples and other things like storm-kafka.
>>>> 
>>>> What I'm proposing is a different yet-to-be-named directory that would
>> be
>>>> home to things that integrate storm with other technologies.
>>>> 
>>>> In the storm-contrib README [2] the term used is "modules". On the Storm
>>>> website, we also use the term "adapter" [3].
>>>> 
>>>> - Taylor
>>>> 
>>>> [1] https://github.com/apache/incubator-storm/pull/44
>>>> [2]
>>>> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
>>>> [3]
>>>> 
>> http://storm.incubator.apache.org/documentation/Spout-implementations.html
>>>> 
>>>> 
>>>> On Mar 13, 2014, at 1:24 AM, David Miller <david.miller@m-square.com.au
>>> 
>>>> wrote:
>>>> 
>>>> 
>>>> what about both ?
>>>> connectors for spout/bolt/states that connect to other tech,
>> storm-kafka,
>>>> storm-cassandra, etc
>>>> extras for other things like storm-starter, storm-deploy, storm-puppet
>>>> 
>>>> 
>>>> 
>>>> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
>>>> 
>>>> I don't like either name tbh. Storm itself is already broken into
>> modules
>>>> (storm-core, storm-netty, etc) and things like storm-starter and
>>>> storm-kafka are something different. I don't like "connectors" because
>>>> something like storm-starter is not a connector. Maybe we call them
>>>> "extras"?
>>>> 
>>>> I would say just to support 0.8.x of Kafka.
>>>> 
>>>> 
>>>> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <ptgoetz@gmail.com
>>> wrote:
>>>> 
>>>>> Incorporation of storm starter is underway.
>>>>> 
>>>>> I'd like to turn the attention to kafka, with the goal being to pull in
>>>>> kafka support that is maintained and will be known to be compatible
>> with
>>>>> the current version of storm and specific version(s) of kafka.
>>>>> 
>>>>> I have the following questions for the community:
>>>>> 
>>>>> 1. What do we want to call additions like this? I'm leaning toward
>>>>> "modules" or "connectors".
>>>>> 
>>>>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or
>> just
>>>>> 0.8.x? From a release management perspective, the latter is preferable
>>>>> because the 0.7.x line artifacts are not in maven central. This makes
>>>>> building a real pain, and maintaining support for two versions won't be
>>>>> fun. Also, most of the people I have worked with are looking at 0.8.x
>> for a
>>>>> variety of reasons, but I'm open to either way.
>>>>> 
>>>>> - Taylor
>>>>> 
>>>>> 
>>>>>> On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
>>>>> michael+storm@michael-noll.com> wrote:
>>>>>> 
>>>>>> Thanks for starting this discussion, Taylor.
>>>>>> 
>>>>>> As a user of Storm (and a small-scale contributor to storm-starter) as
>>>>>> well as a user of Kafka, here are my $.02.
>>>>>> 
>>>>>> [Storm and Kafka]
>>>>>> First, I agree with Nathan that storm-kafka should be considered to be
>>>>>> brought in.  While various "integrate Storm with X" options exist,
>>>>>> basically everyone I have been talking to is using Kafka in
>>>>>> combination with Storm.  I'm sure this is not a representative sample
>>>>>> of Storm users, and of course one may or may not agree that Kafka is
>>>>>> important enough of a technology in Storm's ecosystem.  Still, I do
>>>>>> see the need to make sure Storm and Kafka do work together without
>>>>>> having to go through forks of forks on GitHub and spending days to
>>>>>> figure out how to get data from Kafka (0.8) into Storm.
>>>>>>  Speaking of Kafka spout implementations, please don't forget
>>>>>> https://github.com/HolmesNL/kafka-spout in addition to
>> Wurstmeister's.
>>>>>> We've been quite happy with the former, so I'd suggest to at least
>>>>>> consider both options here (maybe the two projects can even join
>>>>> forces?).
>>>>>> 
>>>>>> [Storm examples, storm-starter]
>>>>>> Second, IMHO every open source project should have a "1-click starting
>>>>>> experience" for new users.  That's very much related to the project
>>>>>> principles of tools like LogStash [1] who say: "Community: If a newbie
>>>>>> has a bad time, it's a bug."  For this reason I personally would like
>>>>>> to see the equivalent of storm-starter being brought into the "core"
>>>>>> Storm project -- think of an examples/ sub-module.  If the level of
>>>>>> effort is deemed too high to e.g. maintain what's already in
>>>>>> storm-starter, then (say) reduce the scope and remove some of the
>>>>>> examples.  In any case I'd personally would like to see bundled
>>>>>> examples that are known to work with the latest version of Storm.
>>>>>> storm-starter is often used to show new users how to get started with
>>>>>> Storm (I used that approach in my Storm blog posts, for instance, and
>>>>>> others like Mesosphere.io are even using storm-starter for their
>>>>>> commercial offerings [2]).
>>>>>> 
>>>>>> [Have Storm up and running faster than you can brew an espresso]
>>>>>> Third, for the same reason (get people up and running in a few
>>>>>> minutes), I do like that other people in this thread have been
>>>>>> bringing up projects like storm-deploy.  For the same reason I have
>>>>>> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>>>>>> few days ago, and I'll soon open source another Vagrant/Puppet based
>>>>>> tool that provides you with 1-click local and remote deployments of
>>>>>> Storm and Kafka clusters.  That's way better IMHO than having to
>>>>>> follow long articles or blog posts to deploy your first cluster.  And
>>>>>> there are a number of other people that have been rolling their own
>>>>>> variants.  Now don't get me wrong -- I don't mention this to pitch any
>>>>>> of those tools.  My intention is to say that it would be greatly
>>>>>> helpful to have /something/ like this for Storm, for the same reason
>>>>>> that it's nice to have LocalCluster for unit testing.  I have been
>>>>>> demo'ing both Storm and Kafka by launching clusters with a simple
>>>>>> command line, which always gets people excited.  If they can then rely
>>>>>> on existing examples (see above) to also /run/ an analysis on "their"
>>>>>> cluster then they have a beautiful start.
>>>>>>  Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>>>>>> VM cluster setup, too [4] so that people can run the Aurora tutorial
>>>>>> on their machines in a few minutes.
>>>>>> 
>>>>>> [Storm and YARN]
>>>>>> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>>>>>> would be nice.  It ties into being able to run LocalCluster as well as
>>>>>> to run Storm in local or remote VMs -- but now alongside your existing
>>>>>> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>>>>>> will surely be similarly attractive.
>>>>>> 
>>>>>> 
>>>>>> On a related note bringing the Storm docs up to speed with the quality
>>>>>> of the Storm code would also be great.  I have seen that since Storm
>>>>>> moved to Incubator several new sections have been added such as the
>>>>>> FAQ [5] (btw: nice!).
>>>>>> 
>>>>>> Similarly, there should be better examples and docs for users how to
>>>>>> write unit tests for Storm.  Right now people seem to be cobbling
>>>>>> together their test code by figuring out how the 1-year old code in
>>>>>> [6] actually works, and copy-pasting other people's test code from
>>>>> GitHub.
>>>>>> 
>>>>>> --
>>>>>> 
>>>>>> As I said above, these are my personal $.02.  I admit that my comments
>>>>>> go a bit beyond the original question of bringing in contrib modules
>>>>>> -- it think implicitly the discussion about the contrib modules also
>>>>>> means "what do you need to provide a better and more well-rounded
>>>>>> experience", i.e. the question whether to have batteries included or
>>>>>> not. (As you may suspect I'm leaning towards included at least the
>>>>>> most important batteries, though what's really "important" for on the
>>>>>> project-level is of course up to debate.)
>>>>>> 
>>>>>> On my side I'd be happy to help with those areas where I am able to
>>>>>> contribute, whether that's code and examples (like storm-starter) or
>>>>>> tutorials/docs (I already wrote e.g. [7] and [8]).
>>>>>> 
>>>>>> Again, thanks Taylor for starting this discussion.  No matter the
>>>>>> actual outcome I'm sure the state of the project will be improved.
>>>>>> 
>>>>>> Best,
>>>>>> Michael
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> [1] https://github.com/elasticsearch/logstash
>>>>>> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>>>>>> [3] https://github.com/miguno/puppet-storm
>>>>>> [4]
>>>>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>>>>>> [5] http://storm.incubator.apache.org/documentation/FAQ.html
>>>>>> [6]
>>>>>> 
>>>>> 
>> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>>>>>> [7]
>>>>>> 
>>>>> 
>> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>>>>>> [8]
>>>>>> 
>>>>> 
>> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>>> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>>>>>>> Thanks for the feedback Bobby.
>>>>>>> 
>>>>>>> To clarify, I'm mainly talking about spout/bolt/trident state
>>>>>>> implementations that integrate storm with *Technology X*, where
>>>>>>> *Technology X* is not a fundamental part of storm.
>>>>>>> 
>>>>>>> Examples would be technologies that are part of or related to the
>>>>>>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>>>>>>> Kafka, HDFS, HBase, Cassandra, etc.
>>>>>>> 
>>>>>>> The idea behind having one or more Storm committers act as a
>>>>>>> "sponsor" is to make sure new additions are done carefully and with
>>>>>>> good reason. To add a new module, it would require committer/PPMC
>>>>>>> consensus, and assignment of one or more sponsors. Part of a
>>>>>>> sponsor's job would be to ensure that a module is maintained, which
>>>>>>> would require enough familiarity with the code so support it long
>>>>>>> term. If a new module was proposed, but no committers were willing
>>>>>>> to act as a sponsor, it would not be added.
>>>>>>> 
>>>>>>> It would be the Committers'/PPMC's responsibly to make sure things
>>>>>>> didn't get out of hand, and to do something about it if it does.
>>>>>>> 
>>>>>>> Here's an old Hadoop JIRA thread [1] discussing the addition of
>>>>>>> Hive as a contrib module, similar to what happened with HBase as
>>>>>>> Bobby pointed out. Some interesting points are brought up. The
>>>>>>> difference here is that both HBase and Hive were pretty big
>>>>>>> codebases relative to Hadoop. With spout/bolt/state implementations
>>>>>>> I doubt we'd see anything along that scale.
>>>>>>> 
>>>>>>> - Taylor
>>>>>>> 
>>>>>>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>>>>>>> 
>>>>>>> 
>>>>>>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>>>>>>> <ma...@yahoo-inc.com>> wrote:
>>>>>>> 
>>>>>>>> I can see a lot of value in having a distribution of storm that
>>>>>>>> comes with batteries included, everything is tested together and
>>>>>>>> you know it works.  But I don't see much long term developer
>>>>>>>> benefit in building them all together.  If there is strong
>>>>>>>> coupling between storm and these external projects so that they
>>>>>>>> break when storm changes then we need to understand the coupling
>>>>>>>> and decide if we want to reduce that coupling by stabilizing
>>>>>>>> APIs, improving version numbering and release process, etc.; or
>>>>>>>> if the functionality is something that should be offered as a
>>>>>>>> base service in storm.
>>>>>>>> 
>>>>>>>> I can see politically the value of giving these other projects a
>>>>>>>> home in Apache, and making them sub-projects is the simplest
>>>>>>>> route to that. I'd love to have storm on yarn inside Apache.  I
>>>>>>>> just don't want to go overboard with it.  There was a time when
>>>>>>>> HBase was a "contrib" module under Hadoop along with a lot of
>>>>>>>> other things, and the Apache board came and told Hadoop to brake
>>>>>>>> it up.
>>>>>>>> 
>>>>>>>> Bringing storm-kafka into storm does not sound like it will solve
>>>>>>>> much from a developer's perspective, because there is at least as
>>>>>>>> much coupling with kafka as there is with storm.  I can see how
>>>>>>>> it is a huge amount of overhead and pain to set up a new project
>>>>>>>> just for a few hundred lines of code, as such I am in favor of
>>>>>>>> pulling in closely related projects, especially those that are
>>>>>>>> spouts and state implementations. I just want to be sure that we
>>>>>>>> do it carefully, with a good reason, and with enough people who
>>>>>>>> are familiar with the code to support it long term.
>>>>>>>> 
>>>>>>>> If it starts to look like we are pulling in too many projects
>>>>>>>> perhaps we should look at something more like the bigtop project
>>>>>>>> https://bigtop.apache.org/ which produces a tested distribution
>>>>>>>> of Hadoop with many different sub-projects included in it.
>>>>>>>> 
>>>>>>>> I am also a bit concerned about these sub-projects becoming
>>>>>>>> second class citizens, where we break something, but because the
>>>>>>>> build is off by default we don't know it.  I would prefer that
>>>>>>>> they are built and tested by default.  If the build and test time
>>>>>>>> starts to take too long, to me that means we need to start
>>>>>>>> wondering if we have too many contrib modules.
>>>>>>>> 
>>>>>>>> --Bobby
>>>>>>>> 
>>>>>>>> From: Brian Enochson <brian.enochson@gmail.com
>>>>>>>> <ma...@gmail.com>>
>>>>>> Reply-To: "user@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> user@storm.incubator.apache.org>"
>>>>>> <user@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> user@storm.incubator.apache.org>>
>>>>>> Date: Tuesday, February 25, 2014 at 9:50 PM
>>>>>>>> To: "user@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> user@storm.incubator.apache.org>"
>>>>>> <user@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> user@storm.incubator.apache.org>>
>>>>>> Cc: "dev@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> dev@storm.incubator.apache.org>"
>>>>>> <dev@storm.incubator.apache.org
>>>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>>>> dev@storm.incubator.apache.org>>
>>>>>> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>>>>>>> 
>>>>>>>> hi, I am in agreement with Taylor and believe I understand his
>>>>>>>> intent. An incredible tool/framework/application like Storm is
>>>>>>>> only enhanced and gains value from the number of well maintained
>>>>>>>> and vetted modules that can be used for integration and adding
>>>>>>>> further functionality. I am relatively new to the Storm community
>>>>>>>> but have spent quite some time reviewing contributing modules out
>>>>>>>> there, reviewing various duplicates and running into some version
>>>>>>>> incompatibilities. I understand the need to keep Storm itself
>>>>>>>> pure, but do think there needs to be some structure and
>>>>>>>> governance added to the contributing modules. Look at the benefit
>>>>>>>> a tool like npm brings to the node community. I like the idea of
>>>>>>>> sponsorship, vetting and a community vote.  I, as sure many would
>>>>>>>> be, am willing to offer support and time to working through how
>>>>>>>> to set this up and helping with the implementation if it is
>>>>>>>> decided to pursue some solution. I hope these views are taken in
>>>>>>>> the sprit they are made, to make this incredible system even
>>>>>>>> better along with the surrounding eco-system.
>>>>>>>> 
>>>>>>>> Thanks, Brian
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>>>>>>> <ptgoetz@gmail.com
>>>>>>>> <ma...@gmail.com>> wrote: Just
>>>>>>>> to be clear (and play a little Devil's advocate :) ), I'm not
>>>>>>>> suggesting that whatever a "contrib" project/module/subproject
>>>>>>>> might become, be a clearinghouse for anything Storm-related.
>>>>>>>> 
>>>>>>>> I see it as something that is well-vetted by the Storm
>>>>>>>> community, subject to PPMC review, vote, etc. Entry would require
>>>>>>>> community review, PPMC review, and in some cases ASF IP
>>>>>>>> clearance/legal review. Anything added would require some level
>>>>>>>> of commitment from the PPMC/committers to provide some level of
>>>>>>>> support.
>>>>>>>> 
>>>>>>>> In other words, nothing "willy-nilly".
>>>>>>>> 
>>>>>>>> One option could be that any module added require (X > 0)  number
>>>>>>>> of committers to volunteer as "sponsor"s for the module, and
>>>>>>>> commit to maintaining it.
>>>>>>>> 
>>>>>>>> That being said, I don't see storm-kafka being any different
>>>>>>>> from anything else that provides integration points for Storm.
>>>>>>>> 
>>>>>>>> -Taylor
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>>>>>>>> <ma...@nathanmarz.com>>
>>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>>>>>>> projects put these contrib modules in a "contrib" folder and keep
>>>>>>>> them managed as completely separate codebases. As it's not
>>>>>>>> actually a "module" necessary for Storm, there's an argument
>>>>>>>> there for doing it that way rather than via the multi-module
>>>>>>>> route.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>>>>>>>> <mpathira@umail.iu.edu
>>>>>>>> <ma...@umail.iu.edu>>
>>>>>>>> wrote: Hi Taylor,
>>>>>>>> 
>>>>>>>> I'm +1 for pulling these external libraries into Apache codebase.
>>>>>>>> This will certainly benifit Strom community. I also like to
>>>>>>>> contribute to this process.
>>>>>>>> 
>>>>>>>> Thanks Milinda
>>>>>>>> 
>>>>>>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>>>>>>> <ptgoetz@gmail.com
>>>>>>>> <ma...@gmail.com>> wrote:
>>>>>>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>>>>>>> pulling in "contrib" modules to the Apache codebase.
>>>>>>>>> 
>>>>>>>>> In the past, we had the storm-contrib github project [2] which
>>>>>>>>> subsequently got broken up into individual projects hosted on
>>>>>>>>> the stormprocessor github group [3] and elsewhere.
>>>>>>>>> 
>>>>>>>>> The problem with this approach is that in certain cases it led
>>>>>>>>> to code rot (modules not being updated in step with Storm's
>>>>>>>>> API), fragmentation (multiple similar modules with the same
>>>>>>>>> name), and confusion.
>>>>>>>>> 
>>>>>>>>> A good example of this is the storm-kafka module [4], since it
>>>>>>>>> is a widely used component. Because storm-contrib wasn't being
>>>>>>>>> tagged in github, a lot of users had trouble reconciling with
>>>>>>>>> which versions of storm it was compatible. Some users built off
>>>>>>>>> specific commit hashes, some forked, and a few even pushed
>>>>>>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>>>>>>> now available, there are two main storm-kafka projects, the
>>>>>>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>>>>>>> (compatible with kafka 0.8).
>>>>>>>>> 
>>>>>>>>> My intention is not to find fault in any way, but rather to
>>>>>>>>> point out the resulting pain, and work toward a better
>>>>>>>>> solution.
>>>>>>>>> 
>>>>>>>>> I think it would be beneficial to the Storm user community to
>>>>>>>>> have certain commonly used modules like storm-kafka brought
>>>>>>>>> into the Apache Storm project. Another benefit worth
>>>>>>>>> considering is the licensing/legal oversight that the ASF
>>>>>>>>> provides, which is important to many users.
>>>>>>>>> 
>>>>>>>>> If this is something we want to do, then the big question
>>>>>>>>> becomes what sort governance process needs to be established to
>>>>>>>>> ensure that such things are properly maintained.
>>>>>>>>> 
>>>>>>>>> Some random thoughts, questions, etc. that jump to mind
>>>>>>>>> include:
>>>>>>>>> 
>>>>>>>>> What to call these things: "contib modules", "connectors",
>>>>>>>>> "integration modules", etc.? Build integration: I imagine they
>>>>>>>>> would be a multi-module submodule of the main maven build.
>>>>>>>>> Probably turned off by default and enabled by a maven profile.
>>>>>>>>> Governance: Have one or more committer volunteers responsible
>>>>>>>>> for maintenance, merging patches, etc.? Proposal process for
>>>>>>>>> pulling new modules?
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> I look forward to hearing others' opinions.
>>>>>>>>> 
>>>>>>>>> - Taylor
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>>>>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>>>>>>> https://github.com/stormprocessor [4]
>>>>>>>>> 
>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>>>> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Twitter: @nathanmarz
>>>> http://nathanmarz.com
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Twitter: @nathanmarz
>>> http://nathanmarz.com
>> 
>>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Ted Dunning <te...@gmail.com>.

Having a committer sign off on each addition has a very large role at
Apache.  One of the key aspects of Apache software releases is that all of
the code is traceable back to the original contributor and there is a
logical chain that allows Apache to stand behind the licensing of the code.

This licensing and chain of evidence is a big part of what makes open
source palatable to risk averse businesses.  It is really important to
maintain.

Storm has a very good record of doing this before being part of Apache
which makes integration into Apache processes easier, but it is important
to hang on to that careful approach.




On Thu, Mar 13, 2014 at 12:58 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Exactly.
>
> That’s why I proposed that anything that’s brought in require at least on
> committer to “sponsor” it:
>
> >>>>> The idea behind having one or more Storm committers act as a
> >>>>> "sponsor" is to make sure new additions are done carefully and with
> >>>>> good reason. To add a new module, it would require committer/PPMC
> >>>>> consensus, and assignment of one or more sponsors. Part of a
> >>>>> sponsor's job would be to ensure that a module is maintained, which
> >>>>> would require enough familiarity with the code so support it long
> >>>>> term. If a new module was proposed, but no committers were willing
> >>>>> to act as a sponsor, it would not be added.
>
>
> Perhaps a README in the directory stating such a policy would make it
> clear that what’s in there is officially endorsed and maintained by the
> PPMC.
>
> So we would have a process in place to prevent the “anything and
> everything” situation. We would only add things related to tech that is
> widely used in conjunction with Storm (Kafka, Cassandra, HDFS, etc.).
>
>
> I’d like to start the work for pulling in storm-kafka, I just need a
> directory to put it in for now. Changing the name later is just one `mv`
> command.
>
> - Taylor
>
>
> On Mar 13, 2014, at 3:08 PM, Nathan Marz <na...@nathanmarz.com> wrote:
>
> > We also don't want to create the impression that anything and everything
> > belongs in the Storm project itself. storm-kafka is special because Kafka
> > works so well with Storm and is so widely used. But if there was a folder
> > called "connectors" or "adapters", people may think we're willing to pull
> > in anything and everything.
> >
> > +1 for putting storm-starter in an examples/ directory.
> >
> >
> > On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>
> wrote:
> >
> >> To clarify somewhat, the pull request for pulling in storm-starter [1]
> >> puts it in an "examples" directory. And there are suggestions to pull
> >> James' scheduler and testing examples in there as well. So there is a
> >> distinction between examples and other things like storm-kafka.
> >>
> >> What I'm proposing is a different yet-to-be-named directory that would
> be
> >> home to things that integrate storm with other technologies.
> >>
> >> In the storm-contrib README [2] the term used is "modules". On the Storm
> >> website, we also use the term "adapter" [3].
> >>
> >> - Taylor
> >>
> >> [1] https://github.com/apache/incubator-storm/pull/44
> >> [2]
> >> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
> >> [3]
> >>
> http://storm.incubator.apache.org/documentation/Spout-implementations.html
> >>
> >>
> >> On Mar 13, 2014, at 1:24 AM, David Miller <david.miller@m-square.com.au
> >
> >> wrote:
> >>
> >>
> >> what about both ?
> >> connectors for spout/bolt/states that connect to other tech,
> storm-kafka,
> >> storm-cassandra, etc
> >> extras for other things like storm-starter, storm-deploy, storm-puppet
> >>
> >>
> >>
> >> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
> >>
> >> I don't like either name tbh. Storm itself is already broken into
> modules
> >> (storm-core, storm-netty, etc) and things like storm-starter and
> >> storm-kafka are something different. I don't like "connectors" because
> >> something like storm-starter is not a connector. Maybe we call them
> >> "extras"?
> >>
> >> I would say just to support 0.8.x of Kafka.
> >>
> >>
> >> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <ptgoetz@gmail.com
> >wrote:
> >>
> >>> Incorporation of storm starter is underway.
> >>>
> >>> I'd like to turn the attention to kafka, with the goal being to pull in
> >>> kafka support that is maintained and will be known to be compatible
> with
> >>> the current version of storm and specific version(s) of kafka.
> >>>
> >>> I have the following questions for the community:
> >>>
> >>> 1. What do we want to call additions like this? I'm leaning toward
> >>> "modules" or "connectors".
> >>>
> >>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or
> just
> >>> 0.8.x? From a release management perspective, the latter is preferable
> >>> because the 0.7.x line artifacts are not in maven central. This makes
> >>> building a real pain, and maintaining support for two versions won't be
> >>> fun. Also, most of the people I have worked with are looking at 0.8.x
> for a
> >>> variety of reasons, but I'm open to either way.
> >>>
> >>> - Taylor
> >>>
> >>>
> >>>> On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
> >>> michael+storm@michael-noll.com> wrote:
> >>>>
> >>>> Thanks for starting this discussion, Taylor.
> >>>>
> >>>> As a user of Storm (and a small-scale contributor to storm-starter) as
> >>>> well as a user of Kafka, here are my $.02.
> >>>>
> >>>> [Storm and Kafka]
> >>>> First, I agree with Nathan that storm-kafka should be considered to be
> >>>> brought in.  While various "integrate Storm with X" options exist,
> >>>> basically everyone I have been talking to is using Kafka in
> >>>> combination with Storm.  I'm sure this is not a representative sample
> >>>> of Storm users, and of course one may or may not agree that Kafka is
> >>>> important enough of a technology in Storm's ecosystem.  Still, I do
> >>>> see the need to make sure Storm and Kafka do work together without
> >>>> having to go through forks of forks on GitHub and spending days to
> >>>> figure out how to get data from Kafka (0.8) into Storm.
> >>>>   Speaking of Kafka spout implementations, please don't forget
> >>>> https://github.com/HolmesNL/kafka-spout in addition to
> Wurstmeister's.
> >>>> We've been quite happy with the former, so I'd suggest to at least
> >>>> consider both options here (maybe the two projects can even join
> >>> forces?).
> >>>>
> >>>> [Storm examples, storm-starter]
> >>>> Second, IMHO every open source project should have a "1-click starting
> >>>> experience" for new users.  That's very much related to the project
> >>>> principles of tools like LogStash [1] who say: "Community: If a newbie
> >>>> has a bad time, it's a bug."  For this reason I personally would like
> >>>> to see the equivalent of storm-starter being brought into the "core"
> >>>> Storm project -- think of an examples/ sub-module.  If the level of
> >>>> effort is deemed too high to e.g. maintain what's already in
> >>>> storm-starter, then (say) reduce the scope and remove some of the
> >>>> examples.  In any case I'd personally would like to see bundled
> >>>> examples that are known to work with the latest version of Storm.
> >>>> storm-starter is often used to show new users how to get started with
> >>>> Storm (I used that approach in my Storm blog posts, for instance, and
> >>>> others like Mesosphere.io are even using storm-starter for their
> >>>> commercial offerings [2]).
> >>>>
> >>>> [Have Storm up and running faster than you can brew an espresso]
> >>>> Third, for the same reason (get people up and running in a few
> >>>> minutes), I do like that other people in this thread have been
> >>>> bringing up projects like storm-deploy.  For the same reason I have
> >>>> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> >>>> few days ago, and I'll soon open source another Vagrant/Puppet based
> >>>> tool that provides you with 1-click local and remote deployments of
> >>>> Storm and Kafka clusters.  That's way better IMHO than having to
> >>>> follow long articles or blog posts to deploy your first cluster.  And
> >>>> there are a number of other people that have been rolling their own
> >>>> variants.  Now don't get me wrong -- I don't mention this to pitch any
> >>>> of those tools.  My intention is to say that it would be greatly
> >>>> helpful to have /something/ like this for Storm, for the same reason
> >>>> that it's nice to have LocalCluster for unit testing.  I have been
> >>>> demo'ing both Storm and Kafka by launching clusters with a simple
> >>>> command line, which always gets people excited.  If they can then rely
> >>>> on existing examples (see above) to also /run/ an analysis on "their"
> >>>> cluster then they have a beautiful start.
> >>>>   Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> >>>> VM cluster setup, too [4] so that people can run the Aurora tutorial
> >>>> on their machines in a few minutes.
> >>>>
> >>>> [Storm and YARN]
> >>>> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> >>>> would be nice.  It ties into being able to run LocalCluster as well as
> >>>> to run Storm in local or remote VMs -- but now alongside your existing
> >>>> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> >>>> will surely be similarly attractive.
> >>>>
> >>>>
> >>>> On a related note bringing the Storm docs up to speed with the quality
> >>>> of the Storm code would also be great.  I have seen that since Storm
> >>>> moved to Incubator several new sections have been added such as the
> >>>> FAQ [5] (btw: nice!).
> >>>>
> >>>> Similarly, there should be better examples and docs for users how to
> >>>> write unit tests for Storm.  Right now people seem to be cobbling
> >>>> together their test code by figuring out how the 1-year old code in
> >>>> [6] actually works, and copy-pasting other people's test code from
> >>> GitHub.
> >>>>
> >>>> --
> >>>>
> >>>> As I said above, these are my personal $.02.  I admit that my comments
> >>>> go a bit beyond the original question of bringing in contrib modules
> >>>> -- it think implicitly the discussion about the contrib modules also
> >>>> means "what do you need to provide a better and more well-rounded
> >>>> experience", i.e. the question whether to have batteries included or
> >>>> not. (As you may suspect I'm leaning towards included at least the
> >>>> most important batteries, though what's really "important" for on the
> >>>> project-level is of course up to debate.)
> >>>>
> >>>> On my side I'd be happy to help with those areas where I am able to
> >>>> contribute, whether that's code and examples (like storm-starter) or
> >>>> tutorials/docs (I already wrote e.g. [7] and [8]).
> >>>>
> >>>> Again, thanks Taylor for starting this discussion.  No matter the
> >>>> actual outcome I'm sure the state of the project will be improved.
> >>>>
> >>>> Best,
> >>>> Michael
> >>>>
> >>>>
> >>>>
> >>>> [1] https://github.com/elasticsearch/logstash
> >>>> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> >>>> [3] https://github.com/miguno/puppet-storm
> >>>> [4]
> >>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> >>>> [5] http://storm.incubator.apache.org/documentation/FAQ.html
> >>>> [6]
> >>>>
> >>>
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> >>>> [7]
> >>>>
> >>>
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> >>>> [8]
> >>>>
> >>>
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >>>>
> >>>>
> >>>>
> >>>>> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >>>>> Thanks for the feedback Bobby.
> >>>>>
> >>>>> To clarify, I'm mainly talking about spout/bolt/trident state
> >>>>> implementations that integrate storm with *Technology X*, where
> >>>>> *Technology X* is not a fundamental part of storm.
> >>>>>
> >>>>> Examples would be technologies that are part of or related to the
> >>>>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >>>>> Kafka, HDFS, HBase, Cassandra, etc.
> >>>>>
> >>>>> The idea behind having one or more Storm committers act as a
> >>>>> "sponsor" is to make sure new additions are done carefully and with
> >>>>> good reason. To add a new module, it would require committer/PPMC
> >>>>> consensus, and assignment of one or more sponsors. Part of a
> >>>>> sponsor's job would be to ensure that a module is maintained, which
> >>>>> would require enough familiarity with the code so support it long
> >>>>> term. If a new module was proposed, but no committers were willing
> >>>>> to act as a sponsor, it would not be added.
> >>>>>
> >>>>> It would be the Committers'/PPMC's responsibly to make sure things
> >>>>> didn't get out of hand, and to do something about it if it does.
> >>>>>
> >>>>> Here's an old Hadoop JIRA thread [1] discussing the addition of
> >>>>> Hive as a contrib module, similar to what happened with HBase as
> >>>>> Bobby pointed out. Some interesting points are brought up. The
> >>>>> difference here is that both HBase and Hive were pretty big
> >>>>> codebases relative to Hadoop. With spout/bolt/state implementations
> >>>>> I doubt we'd see anything along that scale.
> >>>>>
> >>>>> - Taylor
> >>>>>
> >>>>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>>>>
> >>>>>
> >>>>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
> >>>>> <ma...@yahoo-inc.com>> wrote:
> >>>>>
> >>>>>> I can see a lot of value in having a distribution of storm that
> >>>>>> comes with batteries included, everything is tested together and
> >>>>>> you know it works.  But I don't see much long term developer
> >>>>>> benefit in building them all together.  If there is strong
> >>>>>> coupling between storm and these external projects so that they
> >>>>>> break when storm changes then we need to understand the coupling
> >>>>>> and decide if we want to reduce that coupling by stabilizing
> >>>>>> APIs, improving version numbering and release process, etc.; or
> >>>>>> if the functionality is something that should be offered as a
> >>>>>> base service in storm.
> >>>>>>
> >>>>>> I can see politically the value of giving these other projects a
> >>>>>> home in Apache, and making them sub-projects is the simplest
> >>>>>> route to that. I'd love to have storm on yarn inside Apache.  I
> >>>>>> just don't want to go overboard with it.  There was a time when
> >>>>>> HBase was a "contrib" module under Hadoop along with a lot of
> >>>>>> other things, and the Apache board came and told Hadoop to brake
> >>>>>> it up.
> >>>>>>
> >>>>>> Bringing storm-kafka into storm does not sound like it will solve
> >>>>>> much from a developer's perspective, because there is at least as
> >>>>>> much coupling with kafka as there is with storm.  I can see how
> >>>>>> it is a huge amount of overhead and pain to set up a new project
> >>>>>> just for a few hundred lines of code, as such I am in favor of
> >>>>>> pulling in closely related projects, especially those that are
> >>>>>> spouts and state implementations. I just want to be sure that we
> >>>>>> do it carefully, with a good reason, and with enough people who
> >>>>>> are familiar with the code to support it long term.
> >>>>>>
> >>>>>> If it starts to look like we are pulling in too many projects
> >>>>>> perhaps we should look at something more like the bigtop project
> >>>>>> https://bigtop.apache.org/ which produces a tested distribution
> >>>>>> of Hadoop with many different sub-projects included in it.
> >>>>>>
> >>>>>> I am also a bit concerned about these sub-projects becoming
> >>>>>> second class citizens, where we break something, but because the
> >>>>>> build is off by default we don't know it.  I would prefer that
> >>>>>> they are built and tested by default.  If the build and test time
> >>>>>> starts to take too long, to me that means we need to start
> >>>>>> wondering if we have too many contrib modules.
> >>>>>>
> >>>>>> --Bobby
> >>>>>>
> >>>>>> From: Brian Enochson <brian.enochson@gmail.com
> >>>>>> <ma...@gmail.com>>
> >>>> Reply-To: "user@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> user@storm.incubator.apache.org>"
> >>>> <user@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> user@storm.incubator.apache.org>>
> >>>> Date: Tuesday, February 25, 2014 at 9:50 PM
> >>>>>> To: "user@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> user@storm.incubator.apache.org>"
> >>>> <user@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> user@storm.incubator.apache.org>>
> >>>> Cc: "dev@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> dev@storm.incubator.apache.org>"
> >>>> <dev@storm.incubator.apache.org
> >>>>>> <ma...@storm.incubator.apache.org><mailto:
> >>> dev@storm.incubator.apache.org>>
> >>>> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>>>>
> >>>>>> hi, I am in agreement with Taylor and believe I understand his
> >>>>>> intent. An incredible tool/framework/application like Storm is
> >>>>>> only enhanced and gains value from the number of well maintained
> >>>>>> and vetted modules that can be used for integration and adding
> >>>>>> further functionality. I am relatively new to the Storm community
> >>>>>> but have spent quite some time reviewing contributing modules out
> >>>>>> there, reviewing various duplicates and running into some version
> >>>>>> incompatibilities. I understand the need to keep Storm itself
> >>>>>> pure, but do think there needs to be some structure and
> >>>>>> governance added to the contributing modules. Look at the benefit
> >>>>>> a tool like npm brings to the node community. I like the idea of
> >>>>>> sponsorship, vetting and a community vote.  I, as sure many would
> >>>>>> be, am willing to offer support and time to working through how
> >>>>>> to set this up and helping with the implementation if it is
> >>>>>> decided to pursue some solution. I hope these views are taken in
> >>>>>> the sprit they are made, to make this incredible system even
> >>>>>> better along with the surrounding eco-system.
> >>>>>>
> >>>>>> Thanks, Brian
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>>>>> <ptgoetz@gmail.com
> >>>>>> <ma...@gmail.com>> wrote: Just
> >>>>>> to be clear (and play a little Devil's advocate :) ), I'm not
> >>>>>> suggesting that whatever a "contrib" project/module/subproject
> >>>>>> might become, be a clearinghouse for anything Storm-related.
> >>>>>>
> >>>>>> I see it as something that is well-vetted by the Storm
> >>>>>> community, subject to PPMC review, vote, etc. Entry would require
> >>>>>> community review, PPMC review, and in some cases ASF IP
> >>>>>> clearance/legal review. Anything added would require some level
> >>>>>> of commitment from the PPMC/committers to provide some level of
> >>>>>> support.
> >>>>>>
> >>>>>> In other words, nothing "willy-nilly".
> >>>>>>
> >>>>>> One option could be that any module added require (X > 0)  number
> >>>>>> of committers to volunteer as "sponsor"s for the module, and
> >>>>>> commit to maintaining it.
> >>>>>>
> >>>>>> That being said, I don't see storm-kafka being any different
> >>>>>> from anything else that provides integration points for Storm.
> >>>>>>
> >>>>>> -Taylor
> >>>>>>
> >>>>>>
> >>>>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
> >>>>>> <ma...@nathanmarz.com>>
> >>>>>> wrote:
> >>>>>>
> >>>>>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>>>>> projects put these contrib modules in a "contrib" folder and keep
> >>>>>> them managed as completely separate codebases. As it's not
> >>>>>> actually a "module" necessary for Storm, there's an argument
> >>>>>> there for doing it that way rather than via the multi-module
> >>>>>> route.
> >>>>>>
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>>>>> <mpathira@umail.iu.edu
> >>>>>> <ma...@umail.iu.edu>>
> >>>>>> wrote: Hi Taylor,
> >>>>>>
> >>>>>> I'm +1 for pulling these external libraries into Apache codebase.
> >>>>>> This will certainly benifit Strom community. I also like to
> >>>>>> contribute to this process.
> >>>>>>
> >>>>>> Thanks Milinda
> >>>>>>
> >>>>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>>>>> <ptgoetz@gmail.com
> >>>>>> <ma...@gmail.com>> wrote:
> >>>>>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>>>>> pulling in "contrib" modules to the Apache codebase.
> >>>>>>>
> >>>>>>> In the past, we had the storm-contrib github project [2] which
> >>>>>>> subsequently got broken up into individual projects hosted on
> >>>>>>> the stormprocessor github group [3] and elsewhere.
> >>>>>>>
> >>>>>>> The problem with this approach is that in certain cases it led
> >>>>>>> to code rot (modules not being updated in step with Storm's
> >>>>>>> API), fragmentation (multiple similar modules with the same
> >>>>>>> name), and confusion.
> >>>>>>>
> >>>>>>> A good example of this is the storm-kafka module [4], since it
> >>>>>>> is a widely used component. Because storm-contrib wasn't being
> >>>>>>> tagged in github, a lot of users had trouble reconciling with
> >>>>>>> which versions of storm it was compatible. Some users built off
> >>>>>>> specific commit hashes, some forked, and a few even pushed
> >>>>>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>>>>> now available, there are two main storm-kafka projects, the
> >>>>>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>>>>> (compatible with kafka 0.8).
> >>>>>>>
> >>>>>>> My intention is not to find fault in any way, but rather to
> >>>>>>> point out the resulting pain, and work toward a better
> >>>>>>> solution.
> >>>>>>>
> >>>>>>> I think it would be beneficial to the Storm user community to
> >>>>>>> have certain commonly used modules like storm-kafka brought
> >>>>>>> into the Apache Storm project. Another benefit worth
> >>>>>>> considering is the licensing/legal oversight that the ASF
> >>>>>>> provides, which is important to many users.
> >>>>>>>
> >>>>>>> If this is something we want to do, then the big question
> >>>>>>> becomes what sort governance process needs to be established to
> >>>>>>> ensure that such things are properly maintained.
> >>>>>>>
> >>>>>>> Some random thoughts, questions, etc. that jump to mind
> >>>>>>> include:
> >>>>>>>
> >>>>>>> What to call these things: "contib modules", "connectors",
> >>>>>>> "integration modules", etc.? Build integration: I imagine they
> >>>>>>> would be a multi-module submodule of the main maven build.
> >>>>>>> Probably turned off by default and enabled by a maven profile.
> >>>>>>> Governance: Have one or more committer volunteers responsible
> >>>>>>> for maintenance, merging patches, etc.? Proposal process for
> >>>>>>> pulling new modules?
> >>>>>>>
> >>>>>>>
> >>>>>>> I look forward to hearing others' opinions.
> >>>>>>>
> >>>>>>> - Taylor
> >>>>>>>
> >>>>>>>
> >>>>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>>>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>>>>> https://github.com/stormprocessor [4]
> >>>>>>>
> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> >>>> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Twitter: @nathanmarz
> >> http://nathanmarz.com
> >>
> >>
> >>
> >>
> >
> >
> > --
> > Twitter: @nathanmarz
> > http://nathanmarz.com
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Exactly.

That’s why I proposed that anything that’s brought in require at least on committer to “sponsor” it:

>>>>> The idea behind having one or more Storm committers act as a
>>>>> "sponsor" is to make sure new additions are done carefully and with
>>>>> good reason. To add a new module, it would require committer/PPMC
>>>>> consensus, and assignment of one or more sponsors. Part of a
>>>>> sponsor's job would be to ensure that a module is maintained, which
>>>>> would require enough familiarity with the code so support it long
>>>>> term. If a new module was proposed, but no committers were willing
>>>>> to act as a sponsor, it would not be added.


Perhaps a README in the directory stating such a policy would make it clear that what’s in there is officially endorsed and maintained by the PPMC.

So we would have a process in place to prevent the “anything and everything” situation. We would only add things related to tech that is widely used in conjunction with Storm (Kafka, Cassandra, HDFS, etc.).


I’d like to start the work for pulling in storm-kafka, I just need a directory to put it in for now. Changing the name later is just one `mv` command.

- Taylor


On Mar 13, 2014, at 3:08 PM, Nathan Marz <na...@nathanmarz.com> wrote:

> We also don't want to create the impression that anything and everything
> belongs in the Storm project itself. storm-kafka is special because Kafka
> works so well with Storm and is so widely used. But if there was a folder
> called "connectors" or "adapters", people may think we're willing to pull
> in anything and everything.
> 
> +1 for putting storm-starter in an examples/ directory.
> 
> 
> On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> 
>> To clarify somewhat, the pull request for pulling in storm-starter [1]
>> puts it in an "examples" directory. And there are suggestions to pull
>> James' scheduler and testing examples in there as well. So there is a
>> distinction between examples and other things like storm-kafka.
>> 
>> What I'm proposing is a different yet-to-be-named directory that would be
>> home to things that integrate storm with other technologies.
>> 
>> In the storm-contrib README [2] the term used is "modules". On the Storm
>> website, we also use the term "adapter" [3].
>> 
>> - Taylor
>> 
>> [1] https://github.com/apache/incubator-storm/pull/44
>> [2]
>> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
>> [3]
>> http://storm.incubator.apache.org/documentation/Spout-implementations.html
>> 
>> 
>> On Mar 13, 2014, at 1:24 AM, David Miller <da...@m-square.com.au>
>> wrote:
>> 
>> 
>> what about both ?
>> connectors for spout/bolt/states that connect to other tech, storm-kafka,
>> storm-cassandra, etc
>> extras for other things like storm-starter, storm-deploy, storm-puppet
>> 
>> 
>> 
>> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
>> 
>> I don't like either name tbh. Storm itself is already broken into modules
>> (storm-core, storm-netty, etc) and things like storm-starter and
>> storm-kafka are something different. I don't like "connectors" because
>> something like storm-starter is not a connector. Maybe we call them
>> "extras"?
>> 
>> I would say just to support 0.8.x of Kafka.
>> 
>> 
>> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com>wrote:
>> 
>>> Incorporation of storm starter is underway.
>>> 
>>> I'd like to turn the attention to kafka, with the goal being to pull in
>>> kafka support that is maintained and will be known to be compatible with
>>> the current version of storm and specific version(s) of kafka.
>>> 
>>> I have the following questions for the community:
>>> 
>>> 1. What do we want to call additions like this? I'm leaning toward
>>> "modules" or "connectors".
>>> 
>>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
>>> 0.8.x? From a release management perspective, the latter is preferable
>>> because the 0.7.x line artifacts are not in maven central. This makes
>>> building a real pain, and maintaining support for two versions won't be
>>> fun. Also, most of the people I have worked with are looking at 0.8.x for a
>>> variety of reasons, but I'm open to either way.
>>> 
>>> - Taylor
>>> 
>>> 
>>>> On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
>>> michael+storm@michael-noll.com> wrote:
>>>> 
>>>> Thanks for starting this discussion, Taylor.
>>>> 
>>>> As a user of Storm (and a small-scale contributor to storm-starter) as
>>>> well as a user of Kafka, here are my $.02.
>>>> 
>>>> [Storm and Kafka]
>>>> First, I agree with Nathan that storm-kafka should be considered to be
>>>> brought in.  While various "integrate Storm with X" options exist,
>>>> basically everyone I have been talking to is using Kafka in
>>>> combination with Storm.  I'm sure this is not a representative sample
>>>> of Storm users, and of course one may or may not agree that Kafka is
>>>> important enough of a technology in Storm's ecosystem.  Still, I do
>>>> see the need to make sure Storm and Kafka do work together without
>>>> having to go through forks of forks on GitHub and spending days to
>>>> figure out how to get data from Kafka (0.8) into Storm.
>>>>   Speaking of Kafka spout implementations, please don't forget
>>>> https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>>>> We've been quite happy with the former, so I'd suggest to at least
>>>> consider both options here (maybe the two projects can even join
>>> forces?).
>>>> 
>>>> [Storm examples, storm-starter]
>>>> Second, IMHO every open source project should have a "1-click starting
>>>> experience" for new users.  That's very much related to the project
>>>> principles of tools like LogStash [1] who say: "Community: If a newbie
>>>> has a bad time, it's a bug."  For this reason I personally would like
>>>> to see the equivalent of storm-starter being brought into the "core"
>>>> Storm project -- think of an examples/ sub-module.  If the level of
>>>> effort is deemed too high to e.g. maintain what's already in
>>>> storm-starter, then (say) reduce the scope and remove some of the
>>>> examples.  In any case I'd personally would like to see bundled
>>>> examples that are known to work with the latest version of Storm.
>>>> storm-starter is often used to show new users how to get started with
>>>> Storm (I used that approach in my Storm blog posts, for instance, and
>>>> others like Mesosphere.io are even using storm-starter for their
>>>> commercial offerings [2]).
>>>> 
>>>> [Have Storm up and running faster than you can brew an espresso]
>>>> Third, for the same reason (get people up and running in a few
>>>> minutes), I do like that other people in this thread have been
>>>> bringing up projects like storm-deploy.  For the same reason I have
>>>> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>>>> few days ago, and I'll soon open source another Vagrant/Puppet based
>>>> tool that provides you with 1-click local and remote deployments of
>>>> Storm and Kafka clusters.  That's way better IMHO than having to
>>>> follow long articles or blog posts to deploy your first cluster.  And
>>>> there are a number of other people that have been rolling their own
>>>> variants.  Now don't get me wrong -- I don't mention this to pitch any
>>>> of those tools.  My intention is to say that it would be greatly
>>>> helpful to have /something/ like this for Storm, for the same reason
>>>> that it's nice to have LocalCluster for unit testing.  I have been
>>>> demo'ing both Storm and Kafka by launching clusters with a simple
>>>> command line, which always gets people excited.  If they can then rely
>>>> on existing examples (see above) to also /run/ an analysis on "their"
>>>> cluster then they have a beautiful start.
>>>>   Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>>>> VM cluster setup, too [4] so that people can run the Aurora tutorial
>>>> on their machines in a few minutes.
>>>> 
>>>> [Storm and YARN]
>>>> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>>>> would be nice.  It ties into being able to run LocalCluster as well as
>>>> to run Storm in local or remote VMs -- but now alongside your existing
>>>> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>>>> will surely be similarly attractive.
>>>> 
>>>> 
>>>> On a related note bringing the Storm docs up to speed with the quality
>>>> of the Storm code would also be great.  I have seen that since Storm
>>>> moved to Incubator several new sections have been added such as the
>>>> FAQ [5] (btw: nice!).
>>>> 
>>>> Similarly, there should be better examples and docs for users how to
>>>> write unit tests for Storm.  Right now people seem to be cobbling
>>>> together their test code by figuring out how the 1-year old code in
>>>> [6] actually works, and copy-pasting other people's test code from
>>> GitHub.
>>>> 
>>>> --
>>>> 
>>>> As I said above, these are my personal $.02.  I admit that my comments
>>>> go a bit beyond the original question of bringing in contrib modules
>>>> -- it think implicitly the discussion about the contrib modules also
>>>> means "what do you need to provide a better and more well-rounded
>>>> experience", i.e. the question whether to have batteries included or
>>>> not. (As you may suspect I'm leaning towards included at least the
>>>> most important batteries, though what's really "important" for on the
>>>> project-level is of course up to debate.)
>>>> 
>>>> On my side I'd be happy to help with those areas where I am able to
>>>> contribute, whether that's code and examples (like storm-starter) or
>>>> tutorials/docs (I already wrote e.g. [7] and [8]).
>>>> 
>>>> Again, thanks Taylor for starting this discussion.  No matter the
>>>> actual outcome I'm sure the state of the project will be improved.
>>>> 
>>>> Best,
>>>> Michael
>>>> 
>>>> 
>>>> 
>>>> [1] https://github.com/elasticsearch/logstash
>>>> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>>>> [3] https://github.com/miguno/puppet-storm
>>>> [4]
>>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>>>> [5] http://storm.incubator.apache.org/documentation/FAQ.html
>>>> [6]
>>>> 
>>> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>>>> [7]
>>>> 
>>> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>>>> [8]
>>>> 
>>> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>>>> 
>>>> 
>>>> 
>>>>> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>>>>> Thanks for the feedback Bobby.
>>>>> 
>>>>> To clarify, I'm mainly talking about spout/bolt/trident state
>>>>> implementations that integrate storm with *Technology X*, where
>>>>> *Technology X* is not a fundamental part of storm.
>>>>> 
>>>>> Examples would be technologies that are part of or related to the
>>>>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>>>>> Kafka, HDFS, HBase, Cassandra, etc.
>>>>> 
>>>>> The idea behind having one or more Storm committers act as a
>>>>> "sponsor" is to make sure new additions are done carefully and with
>>>>> good reason. To add a new module, it would require committer/PPMC
>>>>> consensus, and assignment of one or more sponsors. Part of a
>>>>> sponsor's job would be to ensure that a module is maintained, which
>>>>> would require enough familiarity with the code so support it long
>>>>> term. If a new module was proposed, but no committers were willing
>>>>> to act as a sponsor, it would not be added.
>>>>> 
>>>>> It would be the Committers'/PPMC's responsibly to make sure things
>>>>> didn't get out of hand, and to do something about it if it does.
>>>>> 
>>>>> Here's an old Hadoop JIRA thread [1] discussing the addition of
>>>>> Hive as a contrib module, similar to what happened with HBase as
>>>>> Bobby pointed out. Some interesting points are brought up. The
>>>>> difference here is that both HBase and Hive were pretty big
>>>>> codebases relative to Hadoop. With spout/bolt/state implementations
>>>>> I doubt we'd see anything along that scale.
>>>>> 
>>>>> - Taylor
>>>>> 
>>>>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>>>>> 
>>>>> 
>>>>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>>>>> <ma...@yahoo-inc.com>> wrote:
>>>>> 
>>>>>> I can see a lot of value in having a distribution of storm that
>>>>>> comes with batteries included, everything is tested together and
>>>>>> you know it works.  But I don't see much long term developer
>>>>>> benefit in building them all together.  If there is strong
>>>>>> coupling between storm and these external projects so that they
>>>>>> break when storm changes then we need to understand the coupling
>>>>>> and decide if we want to reduce that coupling by stabilizing
>>>>>> APIs, improving version numbering and release process, etc.; or
>>>>>> if the functionality is something that should be offered as a
>>>>>> base service in storm.
>>>>>> 
>>>>>> I can see politically the value of giving these other projects a
>>>>>> home in Apache, and making them sub-projects is the simplest
>>>>>> route to that. I'd love to have storm on yarn inside Apache.  I
>>>>>> just don't want to go overboard with it.  There was a time when
>>>>>> HBase was a "contrib" module under Hadoop along with a lot of
>>>>>> other things, and the Apache board came and told Hadoop to brake
>>>>>> it up.
>>>>>> 
>>>>>> Bringing storm-kafka into storm does not sound like it will solve
>>>>>> much from a developer's perspective, because there is at least as
>>>>>> much coupling with kafka as there is with storm.  I can see how
>>>>>> it is a huge amount of overhead and pain to set up a new project
>>>>>> just for a few hundred lines of code, as such I am in favor of
>>>>>> pulling in closely related projects, especially those that are
>>>>>> spouts and state implementations. I just want to be sure that we
>>>>>> do it carefully, with a good reason, and with enough people who
>>>>>> are familiar with the code to support it long term.
>>>>>> 
>>>>>> If it starts to look like we are pulling in too many projects
>>>>>> perhaps we should look at something more like the bigtop project
>>>>>> https://bigtop.apache.org/ which produces a tested distribution
>>>>>> of Hadoop with many different sub-projects included in it.
>>>>>> 
>>>>>> I am also a bit concerned about these sub-projects becoming
>>>>>> second class citizens, where we break something, but because the
>>>>>> build is off by default we don't know it.  I would prefer that
>>>>>> they are built and tested by default.  If the build and test time
>>>>>> starts to take too long, to me that means we need to start
>>>>>> wondering if we have too many contrib modules.
>>>>>> 
>>>>>> --Bobby
>>>>>> 
>>>>>> From: Brian Enochson <brian.enochson@gmail.com
>>>>>> <ma...@gmail.com>>
>>>> Reply-To: "user@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> user@storm.incubator.apache.org>"
>>>> <user@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> user@storm.incubator.apache.org>>
>>>> Date: Tuesday, February 25, 2014 at 9:50 PM
>>>>>> To: "user@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> user@storm.incubator.apache.org>"
>>>> <user@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> user@storm.incubator.apache.org>>
>>>> Cc: "dev@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> dev@storm.incubator.apache.org>"
>>>> <dev@storm.incubator.apache.org
>>>>>> <ma...@storm.incubator.apache.org><mailto:
>>> dev@storm.incubator.apache.org>>
>>>> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>>>>> 
>>>>>> hi, I am in agreement with Taylor and believe I understand his
>>>>>> intent. An incredible tool/framework/application like Storm is
>>>>>> only enhanced and gains value from the number of well maintained
>>>>>> and vetted modules that can be used for integration and adding
>>>>>> further functionality. I am relatively new to the Storm community
>>>>>> but have spent quite some time reviewing contributing modules out
>>>>>> there, reviewing various duplicates and running into some version
>>>>>> incompatibilities. I understand the need to keep Storm itself
>>>>>> pure, but do think there needs to be some structure and
>>>>>> governance added to the contributing modules. Look at the benefit
>>>>>> a tool like npm brings to the node community. I like the idea of
>>>>>> sponsorship, vetting and a community vote.  I, as sure many would
>>>>>> be, am willing to offer support and time to working through how
>>>>>> to set this up and helping with the implementation if it is
>>>>>> decided to pursue some solution. I hope these views are taken in
>>>>>> the sprit they are made, to make this incredible system even
>>>>>> better along with the surrounding eco-system.
>>>>>> 
>>>>>> Thanks, Brian
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>>>>> <ptgoetz@gmail.com
>>>>>> <ma...@gmail.com>> wrote: Just
>>>>>> to be clear (and play a little Devil's advocate :) ), I'm not
>>>>>> suggesting that whatever a "contrib" project/module/subproject
>>>>>> might become, be a clearinghouse for anything Storm-related.
>>>>>> 
>>>>>> I see it as something that is well-vetted by the Storm
>>>>>> community, subject to PPMC review, vote, etc. Entry would require
>>>>>> community review, PPMC review, and in some cases ASF IP
>>>>>> clearance/legal review. Anything added would require some level
>>>>>> of commitment from the PPMC/committers to provide some level of
>>>>>> support.
>>>>>> 
>>>>>> In other words, nothing "willy-nilly".
>>>>>> 
>>>>>> One option could be that any module added require (X > 0)  number
>>>>>> of committers to volunteer as "sponsor"s for the module, and
>>>>>> commit to maintaining it.
>>>>>> 
>>>>>> That being said, I don't see storm-kafka being any different
>>>>>> from anything else that provides integration points for Storm.
>>>>>> 
>>>>>> -Taylor
>>>>>> 
>>>>>> 
>>>>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>>>>>> <ma...@nathanmarz.com>>
>>>>>> wrote:
>>>>>> 
>>>>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>>>>> projects put these contrib modules in a "contrib" folder and keep
>>>>>> them managed as completely separate codebases. As it's not
>>>>>> actually a "module" necessary for Storm, there's an argument
>>>>>> there for doing it that way rather than via the multi-module
>>>>>> route.
>>>>>> 
>>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>>>>>> <mpathira@umail.iu.edu
>>>>>> <ma...@umail.iu.edu>>
>>>>>> wrote: Hi Taylor,
>>>>>> 
>>>>>> I'm +1 for pulling these external libraries into Apache codebase.
>>>>>> This will certainly benifit Strom community. I also like to
>>>>>> contribute to this process.
>>>>>> 
>>>>>> Thanks Milinda
>>>>>> 
>>>>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>>>>> <ptgoetz@gmail.com
>>>>>> <ma...@gmail.com>> wrote:
>>>>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>>>>> pulling in "contrib" modules to the Apache codebase.
>>>>>>> 
>>>>>>> In the past, we had the storm-contrib github project [2] which
>>>>>>> subsequently got broken up into individual projects hosted on
>>>>>>> the stormprocessor github group [3] and elsewhere.
>>>>>>> 
>>>>>>> The problem with this approach is that in certain cases it led
>>>>>>> to code rot (modules not being updated in step with Storm's
>>>>>>> API), fragmentation (multiple similar modules with the same
>>>>>>> name), and confusion.
>>>>>>> 
>>>>>>> A good example of this is the storm-kafka module [4], since it
>>>>>>> is a widely used component. Because storm-contrib wasn't being
>>>>>>> tagged in github, a lot of users had trouble reconciling with
>>>>>>> which versions of storm it was compatible. Some users built off
>>>>>>> specific commit hashes, some forked, and a few even pushed
>>>>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>>>>> now available, there are two main storm-kafka projects, the
>>>>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>>>>> (compatible with kafka 0.8).
>>>>>>> 
>>>>>>> My intention is not to find fault in any way, but rather to
>>>>>>> point out the resulting pain, and work toward a better
>>>>>>> solution.
>>>>>>> 
>>>>>>> I think it would be beneficial to the Storm user community to
>>>>>>> have certain commonly used modules like storm-kafka brought
>>>>>>> into the Apache Storm project. Another benefit worth
>>>>>>> considering is the licensing/legal oversight that the ASF
>>>>>>> provides, which is important to many users.
>>>>>>> 
>>>>>>> If this is something we want to do, then the big question
>>>>>>> becomes what sort governance process needs to be established to
>>>>>>> ensure that such things are properly maintained.
>>>>>>> 
>>>>>>> Some random thoughts, questions, etc. that jump to mind
>>>>>>> include:
>>>>>>> 
>>>>>>> What to call these things: "contib modules", "connectors",
>>>>>>> "integration modules", etc.? Build integration: I imagine they
>>>>>>> would be a multi-module submodule of the main maven build.
>>>>>>> Probably turned off by default and enabled by a maven profile.
>>>>>>> Governance: Have one or more committer volunteers responsible
>>>>>>> for maintenance, merging patches, etc.? Proposal process for
>>>>>>> pulling new modules?
>>>>>>> 
>>>>>>> 
>>>>>>> I look forward to hearing others' opinions.
>>>>>>> 
>>>>>>> - Taylor
>>>>>>> 
>>>>>>> 
>>>>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>>>>> https://github.com/stormprocessor [4]
>>>>>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>>>> 
>>> 
>> 
>> 
>> 
>> --
>> Twitter: @nathanmarz
>> http://nathanmarz.com
>> 
>> 
>> 
>> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

We also don't want to create the impression that anything and everything
belongs in the Storm project itself. storm-kafka is special because Kafka
works so well with Storm and is so widely used. But if there was a folder
called "connectors" or "adapters", people may think we're willing to pull
in anything and everything.

+1 for putting storm-starter in an examples/ directory.


On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> To clarify somewhat, the pull request for pulling in storm-starter [1]
> puts it in an "examples" directory. And there are suggestions to pull
>  James' scheduler and testing examples in there as well. So there is a
> distinction between examples and other things like storm-kafka.
>
> What I'm proposing is a different yet-to-be-named directory that would be
> home to things that integrate storm with other technologies.
>
> In the storm-contrib README [2] the term used is "modules". On the Storm
> website, we also use the term "adapter" [3].
>
> - Taylor
>
> [1] https://github.com/apache/incubator-storm/pull/44
> [2]
> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
> [3]
> http://storm.incubator.apache.org/documentation/Spout-implementations.html
>
>
> On Mar 13, 2014, at 1:24 AM, David Miller <da...@m-square.com.au>
> wrote:
>
>
> what about both ?
> connectors for spout/bolt/states that connect to other tech, storm-kafka,
> storm-cassandra, etc
> extras for other things like storm-starter, storm-deploy, storm-puppet
>
>
>
> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
>
> I don't like either name tbh. Storm itself is already broken into modules
> (storm-core, storm-netty, etc) and things like storm-starter and
> storm-kafka are something different. I don't like "connectors" because
> something like storm-starter is not a connector. Maybe we call them
> "extras"?
>
> I would say just to support 0.8.x of Kafka.
>
>
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com>wrote:
>
>> Incorporation of storm starter is underway.
>>
>> I'd like to turn the attention to kafka, with the goal being to pull in
>> kafka support that is maintained and will be known to be compatible with
>> the current version of storm and specific version(s) of kafka.
>>
>> I have the following questions for the community:
>>
>> 1. What do we want to call additions like this? I'm leaning toward
>> "modules" or "connectors".
>>
>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
>> 0.8.x? From a release management perspective, the latter is preferable
>> because the 0.7.x line artifacts are not in maven central. This makes
>> building a real pain, and maintaining support for two versions won't be
>> fun. Also, most of the people I have worked with are looking at 0.8.x for a
>> variety of reasons, but I'm open to either way.
>>
>> - Taylor
>>
>>
>> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
>> michael+storm@michael-noll.com> wrote:
>> >
>> > Thanks for starting this discussion, Taylor.
>> >
>> > As a user of Storm (and a small-scale contributor to storm-starter) as
>> > well as a user of Kafka, here are my $.02.
>> >
>> > [Storm and Kafka]
>> > First, I agree with Nathan that storm-kafka should be considered to be
>> > brought in.  While various "integrate Storm with X" options exist,
>> > basically everyone I have been talking to is using Kafka in
>> > combination with Storm.  I'm sure this is not a representative sample
>> > of Storm users, and of course one may or may not agree that Kafka is
>> > important enough of a technology in Storm's ecosystem.  Still, I do
>> > see the need to make sure Storm and Kafka do work together without
>> > having to go through forks of forks on GitHub and spending days to
>> > figure out how to get data from Kafka (0.8) into Storm.
>> >    Speaking of Kafka spout implementations, please don't forget
>> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>> > We've been quite happy with the former, so I'd suggest to at least
>> > consider both options here (maybe the two projects can even join
>> forces?).
>> >
>> > [Storm examples, storm-starter]
>> > Second, IMHO every open source project should have a "1-click starting
>> > experience" for new users.  That's very much related to the project
>> > principles of tools like LogStash [1] who say: "Community: If a newbie
>> > has a bad time, it's a bug."  For this reason I personally would like
>> > to see the equivalent of storm-starter being brought into the "core"
>> > Storm project -- think of an examples/ sub-module.  If the level of
>> > effort is deemed too high to e.g. maintain what's already in
>> > storm-starter, then (say) reduce the scope and remove some of the
>> > examples.  In any case I'd personally would like to see bundled
>> > examples that are known to work with the latest version of Storm.
>> > storm-starter is often used to show new users how to get started with
>> > Storm (I used that approach in my Storm blog posts, for instance, and
>> > others like Mesosphere.io are even using storm-starter for their
>> > commercial offerings [2]).
>> >
>> > [Have Storm up and running faster than you can brew an espresso]
>> > Third, for the same reason (get people up and running in a few
>> > minutes), I do like that other people in this thread have been
>> > bringing up projects like storm-deploy.  For the same reason I have
>> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>> > few days ago, and I'll soon open source another Vagrant/Puppet based
>> > tool that provides you with 1-click local and remote deployments of
>> > Storm and Kafka clusters.  That's way better IMHO than having to
>> > follow long articles or blog posts to deploy your first cluster.  And
>> > there are a number of other people that have been rolling their own
>> > variants.  Now don't get me wrong -- I don't mention this to pitch any
>> > of those tools.  My intention is to say that it would be greatly
>> > helpful to have /something/ like this for Storm, for the same reason
>> > that it's nice to have LocalCluster for unit testing.  I have been
>> > demo'ing both Storm and Kafka by launching clusters with a simple
>> > command line, which always gets people excited.  If they can then rely
>> > on existing examples (see above) to also /run/ an analysis on "their"
>> > cluster then they have a beautiful start.
>> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>> > VM cluster setup, too [4] so that people can run the Aurora tutorial
>> > on their machines in a few minutes.
>> >
>> > [Storm and YARN]
>> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>> > would be nice.  It ties into being able to run LocalCluster as well as
>> > to run Storm in local or remote VMs -- but now alongside your existing
>> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>> > will surely be similarly attractive.
>> >
>> >
>> > On a related note bringing the Storm docs up to speed with the quality
>> > of the Storm code would also be great.  I have seen that since Storm
>> > moved to Incubator several new sections have been added such as the
>> > FAQ [5] (btw: nice!).
>> >
>> > Similarly, there should be better examples and docs for users how to
>> > write unit tests for Storm.  Right now people seem to be cobbling
>> > together their test code by figuring out how the 1-year old code in
>> > [6] actually works, and copy-pasting other people's test code from
>> GitHub.
>> >
>> > --
>> >
>> > As I said above, these are my personal $.02.  I admit that my comments
>> > go a bit beyond the original question of bringing in contrib modules
>> > -- it think implicitly the discussion about the contrib modules also
>> > means "what do you need to provide a better and more well-rounded
>> > experience", i.e. the question whether to have batteries included or
>> > not. (As you may suspect I'm leaning towards included at least the
>> > most important batteries, though what's really "important" for on the
>> > project-level is of course up to debate.)
>> >
>> > On my side I'd be happy to help with those areas where I am able to
>> > contribute, whether that's code and examples (like storm-starter) or
>> > tutorials/docs (I already wrote e.g. [7] and [8]).
>> >
>> > Again, thanks Taylor for starting this discussion.  No matter the
>> > actual outcome I'm sure the state of the project will be improved.
>> >
>> > Best,
>> > Michael
>> >
>> >
>> >
>> > [1] https://github.com/elasticsearch/logstash
>> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>> > [3] https://github.com/miguno/puppet-storm
>> > [4]
>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
>> > [6]
>> >
>> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>> > [7]
>> >
>> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>> > [8]
>> >
>> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>> >
>> >
>> >
>> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> >> Thanks for the feedback Bobby.
>> >>
>> >> To clarify, I'm mainly talking about spout/bolt/trident state
>> >> implementations that integrate storm with *Technology X*, where
>> >> *Technology X* is not a fundamental part of storm.
>> >>
>> >> Examples would be technologies that are part of or related to the
>> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>> >> Kafka, HDFS, HBase, Cassandra, etc.
>> >>
>> >> The idea behind having one or more Storm committers act as a
>> >> "sponsor" is to make sure new additions are done carefully and with
>> >> good reason. To add a new module, it would require committer/PPMC
>> >> consensus, and assignment of one or more sponsors. Part of a
>> >> sponsor's job would be to ensure that a module is maintained, which
>> >> would require enough familiarity with the code so support it long
>> >> term. If a new module was proposed, but no committers were willing
>> >> to act as a sponsor, it would not be added.
>> >>
>> >> It would be the Committers'/PPMC's responsibly to make sure things
>> >> didn't get out of hand, and to do something about it if it does.
>> >>
>> >> Here's an old Hadoop JIRA thread [1] discussing the addition of
>> >> Hive as a contrib module, similar to what happened with HBase as
>> >> Bobby pointed out. Some interesting points are brought up. The
>> >> difference here is that both HBase and Hive were pretty big
>> >> codebases relative to Hadoop. With spout/bolt/state implementations
>> >> I doubt we'd see anything along that scale.
>> >>
>> >> - Taylor
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> >>
>> >>
>> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>> >> <ma...@yahoo-inc.com>> wrote:
>> >>
>> >>> I can see a lot of value in having a distribution of storm that
>> >>> comes with batteries included, everything is tested together and
>> >>> you know it works.  But I don't see much long term developer
>> >>> benefit in building them all together.  If there is strong
>> >>> coupling between storm and these external projects so that they
>> >>> break when storm changes then we need to understand the coupling
>> >>> and decide if we want to reduce that coupling by stabilizing
>> >>> APIs, improving version numbering and release process, etc.; or
>> >>> if the functionality is something that should be offered as a
>> >>> base service in storm.
>> >>>
>> >>> I can see politically the value of giving these other projects a
>> >>> home in Apache, and making them sub-projects is the simplest
>> >>> route to that. I'd love to have storm on yarn inside Apache.  I
>> >>> just don't want to go overboard with it.  There was a time when
>> >>> HBase was a "contrib" module under Hadoop along with a lot of
>> >>> other things, and the Apache board came and told Hadoop to brake
>> >>> it up.
>> >>>
>> >>> Bringing storm-kafka into storm does not sound like it will solve
>> >>> much from a developer's perspective, because there is at least as
>> >>> much coupling with kafka as there is with storm.  I can see how
>> >>> it is a huge amount of overhead and pain to set up a new project
>> >>> just for a few hundred lines of code, as such I am in favor of
>> >>> pulling in closely related projects, especially those that are
>> >>> spouts and state implementations. I just want to be sure that we
>> >>> do it carefully, with a good reason, and with enough people who
>> >>> are familiar with the code to support it long term.
>> >>>
>> >>> If it starts to look like we are pulling in too many projects
>> >>> perhaps we should look at something more like the bigtop project
>> >>> https://bigtop.apache.org/ which produces a tested distribution
>> >>> of Hadoop with many different sub-projects included in it.
>> >>>
>> >>> I am also a bit concerned about these sub-projects becoming
>> >>> second class citizens, where we break something, but because the
>> >>> build is off by default we don't know it.  I would prefer that
>> >>> they are built and tested by default.  If the build and test time
>> >>> starts to take too long, to me that means we need to start
>> >>> wondering if we have too many contrib modules.
>> >>>
>> >>> --Bobby
>> >>>
>> >>> From: Brian Enochson <brian.enochson@gmail.com
>> >>> <ma...@gmail.com>>
>> > Reply-To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>>
>> > Date: Tuesday, February 25, 2014 at 9:50 PM
>> >>> To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>>
>> > Cc: "dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> dev@storm.incubator.apache.org>"
>> > <dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> dev@storm.incubator.apache.org>>
>> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> >>>
>> >>> hi, I am in agreement with Taylor and believe I understand his
>> >>> intent. An incredible tool/framework/application like Storm is
>> >>> only enhanced and gains value from the number of well maintained
>> >>> and vetted modules that can be used for integration and adding
>> >>> further functionality. I am relatively new to the Storm community
>> >>> but have spent quite some time reviewing contributing modules out
>> >>> there, reviewing various duplicates and running into some version
>> >>> incompatibilities. I understand the need to keep Storm itself
>> >>> pure, but do think there needs to be some structure and
>> >>> governance added to the contributing modules. Look at the benefit
>> >>> a tool like npm brings to the node community. I like the idea of
>> >>> sponsorship, vetting and a community vote.  I, as sure many would
>> >>> be, am willing to offer support and time to working through how
>> >>> to set this up and helping with the implementation if it is
>> >>> decided to pursue some solution. I hope these views are taken in
>> >>> the sprit they are made, to make this incredible system even
>> >>> better along with the surrounding eco-system.
>> >>>
>> >>> Thanks, Brian
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote: Just
>> >>> to be clear (and play a little Devil's advocate :) ), I'm not
>> >>> suggesting that whatever a "contrib" project/module/subproject
>> >>> might become, be a clearinghouse for anything Storm-related.
>> >>>
>> >>> I see it as something that is well-vetted by the Storm
>> >>> community, subject to PPMC review, vote, etc. Entry would require
>> >>> community review, PPMC review, and in some cases ASF IP
>> >>> clearance/legal review. Anything added would require some level
>> >>> of commitment from the PPMC/committers to provide some level of
>> >>> support.
>> >>>
>> >>> In other words, nothing "willy-nilly".
>> >>>
>> >>> One option could be that any module added require (X > 0)  number
>> >>> of committers to volunteer as "sponsor"s for the module, and
>> >>> commit to maintaining it.
>> >>>
>> >>> That being said, I don't see storm-kafka being any different
>> >>> from anything else that provides integration points for Storm.
>> >>>
>> >>> -Taylor
>> >>>
>> >>>
>> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>> >>> <ma...@nathanmarz.com>>
>> >>> wrote:
>> >>>
>> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> >>> projects put these contrib modules in a "contrib" folder and keep
>> >>> them managed as completely separate codebases. As it's not
>> >>> actually a "module" necessary for Storm, there's an argument
>> >>> there for doing it that way rather than via the multi-module
>> >>> route.
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>> >>> <mpathira@umail.iu.edu
>> >>> <ma...@umail.iu.edu>>
>> >>> wrote: Hi Taylor,
>> >>>
>> >>> I'm +1 for pulling these external libraries into Apache codebase.
>> >>> This will certainly benifit Strom community. I also like to
>> >>> contribute to this process.
>> >>>
>> >>> Thanks Milinda
>> >>>
>> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>> A while back I opened STORM-206 [1] to capture ideas for
>> >>>> pulling in "contrib" modules to the Apache codebase.
>> >>>>
>> >>>> In the past, we had the storm-contrib github project [2] which
>> >>>> subsequently got broken up into individual projects hosted on
>> >>>> the stormprocessor github group [3] and elsewhere.
>> >>>>
>> >>>> The problem with this approach is that in certain cases it led
>> >>>> to code rot (modules not being updated in step with Storm's
>> >>>> API), fragmentation (multiple similar modules with the same
>> >>>> name), and confusion.
>> >>>>
>> >>>> A good example of this is the storm-kafka module [4], since it
>> >>>> is a widely used component. Because storm-contrib wasn't being
>> >>>> tagged in github, a lot of users had trouble reconciling with
>> >>>> which versions of storm it was compatible. Some users built off
>> >>>> specific commit hashes, some forked, and a few even pushed
>> >>>> custom builds to repositories such as clojars. With kafka 0.8
>> >>>> now available, there are two main storm-kafka projects, the
>> >>>> original (compatible with kafka 0.7) and an updated fork [5]
>> >>>> (compatible with kafka 0.8).
>> >>>>
>> >>>> My intention is not to find fault in any way, but rather to
>> >>>> point out the resulting pain, and work toward a better
>> >>>> solution.
>> >>>>
>> >>>> I think it would be beneficial to the Storm user community to
>> >>>> have certain commonly used modules like storm-kafka brought
>> >>>> into the Apache Storm project. Another benefit worth
>> >>>> considering is the licensing/legal oversight that the ASF
>> >>>> provides, which is important to many users.
>> >>>>
>> >>>> If this is something we want to do, then the big question
>> >>>> becomes what sort governance process needs to be established to
>> >>>> ensure that such things are properly maintained.
>> >>>>
>> >>>> Some random thoughts, questions, etc. that jump to mind
>> >>>> include:
>> >>>>
>> >>>> What to call these things: "contib modules", "connectors",
>> >>>> "integration modules", etc.? Build integration: I imagine they
>> >>>> would be a multi-module submodule of the main maven build.
>> >>>> Probably turned off by default and enabled by a maven profile.
>> >>>> Governance: Have one or more committer volunteers responsible
>> >>>> for maintenance, merging patches, etc.? Proposal process for
>> >>>> pulling new modules?
>> >>>>
>> >>>>
>> >>>> I look forward to hearing others' opinions.
>> >>>>
>> >>>> - Taylor
>> >>>>
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>> >>>> https://github.com/nathanmarz/storm-contrib [3]
>> >>>> https://github.com/stormprocessor [4]
>> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>> >
>>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>
>
>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

We also don't want to create the impression that anything and everything
belongs in the Storm project itself. storm-kafka is special because Kafka
works so well with Storm and is so widely used. But if there was a folder
called "connectors" or "adapters", people may think we're willing to pull
in anything and everything.

+1 for putting storm-starter in an examples/ directory.


On Thu, Mar 13, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> To clarify somewhat, the pull request for pulling in storm-starter [1]
> puts it in an "examples" directory. And there are suggestions to pull
>  James' scheduler and testing examples in there as well. So there is a
> distinction between examples and other things like storm-kafka.
>
> What I'm proposing is a different yet-to-be-named directory that would be
> home to things that integrate storm with other technologies.
>
> In the storm-contrib README [2] the term used is "modules". On the Storm
> website, we also use the term "adapter" [3].
>
> - Taylor
>
> [1] https://github.com/apache/incubator-storm/pull/44
> [2]
> https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
> [3]
> http://storm.incubator.apache.org/documentation/Spout-implementations.html
>
>
> On Mar 13, 2014, at 1:24 AM, David Miller <da...@m-square.com.au>
> wrote:
>
>
> what about both ?
> connectors for spout/bolt/states that connect to other tech, storm-kafka,
> storm-cassandra, etc
> extras for other things like storm-starter, storm-deploy, storm-puppet
>
>
>
> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
>
> I don't like either name tbh. Storm itself is already broken into modules
> (storm-core, storm-netty, etc) and things like storm-starter and
> storm-kafka are something different. I don't like "connectors" because
> something like storm-starter is not a connector. Maybe we call them
> "extras"?
>
> I would say just to support 0.8.x of Kafka.
>
>
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com>wrote:
>
>> Incorporation of storm starter is underway.
>>
>> I'd like to turn the attention to kafka, with the goal being to pull in
>> kafka support that is maintained and will be known to be compatible with
>> the current version of storm and specific version(s) of kafka.
>>
>> I have the following questions for the community:
>>
>> 1. What do we want to call additions like this? I'm leaning toward
>> "modules" or "connectors".
>>
>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
>> 0.8.x? From a release management perspective, the latter is preferable
>> because the 0.7.x line artifacts are not in maven central. This makes
>> building a real pain, and maintaining support for two versions won't be
>> fun. Also, most of the people I have worked with are looking at 0.8.x for a
>> variety of reasons, but I'm open to either way.
>>
>> - Taylor
>>
>>
>> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
>> michael+storm@michael-noll.com> wrote:
>> >
>> > Thanks for starting this discussion, Taylor.
>> >
>> > As a user of Storm (and a small-scale contributor to storm-starter) as
>> > well as a user of Kafka, here are my $.02.
>> >
>> > [Storm and Kafka]
>> > First, I agree with Nathan that storm-kafka should be considered to be
>> > brought in.  While various "integrate Storm with X" options exist,
>> > basically everyone I have been talking to is using Kafka in
>> > combination with Storm.  I'm sure this is not a representative sample
>> > of Storm users, and of course one may or may not agree that Kafka is
>> > important enough of a technology in Storm's ecosystem.  Still, I do
>> > see the need to make sure Storm and Kafka do work together without
>> > having to go through forks of forks on GitHub and spending days to
>> > figure out how to get data from Kafka (0.8) into Storm.
>> >    Speaking of Kafka spout implementations, please don't forget
>> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>> > We've been quite happy with the former, so I'd suggest to at least
>> > consider both options here (maybe the two projects can even join
>> forces?).
>> >
>> > [Storm examples, storm-starter]
>> > Second, IMHO every open source project should have a "1-click starting
>> > experience" for new users.  That's very much related to the project
>> > principles of tools like LogStash [1] who say: "Community: If a newbie
>> > has a bad time, it's a bug."  For this reason I personally would like
>> > to see the equivalent of storm-starter being brought into the "core"
>> > Storm project -- think of an examples/ sub-module.  If the level of
>> > effort is deemed too high to e.g. maintain what's already in
>> > storm-starter, then (say) reduce the scope and remove some of the
>> > examples.  In any case I'd personally would like to see bundled
>> > examples that are known to work with the latest version of Storm.
>> > storm-starter is often used to show new users how to get started with
>> > Storm (I used that approach in my Storm blog posts, for instance, and
>> > others like Mesosphere.io are even using storm-starter for their
>> > commercial offerings [2]).
>> >
>> > [Have Storm up and running faster than you can brew an espresso]
>> > Third, for the same reason (get people up and running in a few
>> > minutes), I do like that other people in this thread have been
>> > bringing up projects like storm-deploy.  For the same reason I have
>> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>> > few days ago, and I'll soon open source another Vagrant/Puppet based
>> > tool that provides you with 1-click local and remote deployments of
>> > Storm and Kafka clusters.  That's way better IMHO than having to
>> > follow long articles or blog posts to deploy your first cluster.  And
>> > there are a number of other people that have been rolling their own
>> > variants.  Now don't get me wrong -- I don't mention this to pitch any
>> > of those tools.  My intention is to say that it would be greatly
>> > helpful to have /something/ like this for Storm, for the same reason
>> > that it's nice to have LocalCluster for unit testing.  I have been
>> > demo'ing both Storm and Kafka by launching clusters with a simple
>> > command line, which always gets people excited.  If they can then rely
>> > on existing examples (see above) to also /run/ an analysis on "their"
>> > cluster then they have a beautiful start.
>> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>> > VM cluster setup, too [4] so that people can run the Aurora tutorial
>> > on their machines in a few minutes.
>> >
>> > [Storm and YARN]
>> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>> > would be nice.  It ties into being able to run LocalCluster as well as
>> > to run Storm in local or remote VMs -- but now alongside your existing
>> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>> > will surely be similarly attractive.
>> >
>> >
>> > On a related note bringing the Storm docs up to speed with the quality
>> > of the Storm code would also be great.  I have seen that since Storm
>> > moved to Incubator several new sections have been added such as the
>> > FAQ [5] (btw: nice!).
>> >
>> > Similarly, there should be better examples and docs for users how to
>> > write unit tests for Storm.  Right now people seem to be cobbling
>> > together their test code by figuring out how the 1-year old code in
>> > [6] actually works, and copy-pasting other people's test code from
>> GitHub.
>> >
>> > --
>> >
>> > As I said above, these are my personal $.02.  I admit that my comments
>> > go a bit beyond the original question of bringing in contrib modules
>> > -- it think implicitly the discussion about the contrib modules also
>> > means "what do you need to provide a better and more well-rounded
>> > experience", i.e. the question whether to have batteries included or
>> > not. (As you may suspect I'm leaning towards included at least the
>> > most important batteries, though what's really "important" for on the
>> > project-level is of course up to debate.)
>> >
>> > On my side I'd be happy to help with those areas where I am able to
>> > contribute, whether that's code and examples (like storm-starter) or
>> > tutorials/docs (I already wrote e.g. [7] and [8]).
>> >
>> > Again, thanks Taylor for starting this discussion.  No matter the
>> > actual outcome I'm sure the state of the project will be improved.
>> >
>> > Best,
>> > Michael
>> >
>> >
>> >
>> > [1] https://github.com/elasticsearch/logstash
>> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>> > [3] https://github.com/miguno/puppet-storm
>> > [4]
>> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
>> > [6]
>> >
>> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>> > [7]
>> >
>> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>> > [8]
>> >
>> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>> >
>> >
>> >
>> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> >> Thanks for the feedback Bobby.
>> >>
>> >> To clarify, I'm mainly talking about spout/bolt/trident state
>> >> implementations that integrate storm with *Technology X*, where
>> >> *Technology X* is not a fundamental part of storm.
>> >>
>> >> Examples would be technologies that are part of or related to the
>> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>> >> Kafka, HDFS, HBase, Cassandra, etc.
>> >>
>> >> The idea behind having one or more Storm committers act as a
>> >> "sponsor" is to make sure new additions are done carefully and with
>> >> good reason. To add a new module, it would require committer/PPMC
>> >> consensus, and assignment of one or more sponsors. Part of a
>> >> sponsor's job would be to ensure that a module is maintained, which
>> >> would require enough familiarity with the code so support it long
>> >> term. If a new module was proposed, but no committers were willing
>> >> to act as a sponsor, it would not be added.
>> >>
>> >> It would be the Committers'/PPMC's responsibly to make sure things
>> >> didn't get out of hand, and to do something about it if it does.
>> >>
>> >> Here's an old Hadoop JIRA thread [1] discussing the addition of
>> >> Hive as a contrib module, similar to what happened with HBase as
>> >> Bobby pointed out. Some interesting points are brought up. The
>> >> difference here is that both HBase and Hive were pretty big
>> >> codebases relative to Hadoop. With spout/bolt/state implementations
>> >> I doubt we'd see anything along that scale.
>> >>
>> >> - Taylor
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> >>
>> >>
>> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>> >> <ma...@yahoo-inc.com>> wrote:
>> >>
>> >>> I can see a lot of value in having a distribution of storm that
>> >>> comes with batteries included, everything is tested together and
>> >>> you know it works.  But I don't see much long term developer
>> >>> benefit in building them all together.  If there is strong
>> >>> coupling between storm and these external projects so that they
>> >>> break when storm changes then we need to understand the coupling
>> >>> and decide if we want to reduce that coupling by stabilizing
>> >>> APIs, improving version numbering and release process, etc.; or
>> >>> if the functionality is something that should be offered as a
>> >>> base service in storm.
>> >>>
>> >>> I can see politically the value of giving these other projects a
>> >>> home in Apache, and making them sub-projects is the simplest
>> >>> route to that. I'd love to have storm on yarn inside Apache.  I
>> >>> just don't want to go overboard with it.  There was a time when
>> >>> HBase was a "contrib" module under Hadoop along with a lot of
>> >>> other things, and the Apache board came and told Hadoop to brake
>> >>> it up.
>> >>>
>> >>> Bringing storm-kafka into storm does not sound like it will solve
>> >>> much from a developer's perspective, because there is at least as
>> >>> much coupling with kafka as there is with storm.  I can see how
>> >>> it is a huge amount of overhead and pain to set up a new project
>> >>> just for a few hundred lines of code, as such I am in favor of
>> >>> pulling in closely related projects, especially those that are
>> >>> spouts and state implementations. I just want to be sure that we
>> >>> do it carefully, with a good reason, and with enough people who
>> >>> are familiar with the code to support it long term.
>> >>>
>> >>> If it starts to look like we are pulling in too many projects
>> >>> perhaps we should look at something more like the bigtop project
>> >>> https://bigtop.apache.org/ which produces a tested distribution
>> >>> of Hadoop with many different sub-projects included in it.
>> >>>
>> >>> I am also a bit concerned about these sub-projects becoming
>> >>> second class citizens, where we break something, but because the
>> >>> build is off by default we don't know it.  I would prefer that
>> >>> they are built and tested by default.  If the build and test time
>> >>> starts to take too long, to me that means we need to start
>> >>> wondering if we have too many contrib modules.
>> >>>
>> >>> --Bobby
>> >>>
>> >>> From: Brian Enochson <brian.enochson@gmail.com
>> >>> <ma...@gmail.com>>
>> > Reply-To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>>
>> > Date: Tuesday, February 25, 2014 at 9:50 PM
>> >>> To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> user@storm.incubator.apache.org>>
>> > Cc: "dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> dev@storm.incubator.apache.org>"
>> > <dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org><mailto:
>> dev@storm.incubator.apache.org>>
>> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> >>>
>> >>> hi, I am in agreement with Taylor and believe I understand his
>> >>> intent. An incredible tool/framework/application like Storm is
>> >>> only enhanced and gains value from the number of well maintained
>> >>> and vetted modules that can be used for integration and adding
>> >>> further functionality. I am relatively new to the Storm community
>> >>> but have spent quite some time reviewing contributing modules out
>> >>> there, reviewing various duplicates and running into some version
>> >>> incompatibilities. I understand the need to keep Storm itself
>> >>> pure, but do think there needs to be some structure and
>> >>> governance added to the contributing modules. Look at the benefit
>> >>> a tool like npm brings to the node community. I like the idea of
>> >>> sponsorship, vetting and a community vote.  I, as sure many would
>> >>> be, am willing to offer support and time to working through how
>> >>> to set this up and helping with the implementation if it is
>> >>> decided to pursue some solution. I hope these views are taken in
>> >>> the sprit they are made, to make this incredible system even
>> >>> better along with the surrounding eco-system.
>> >>>
>> >>> Thanks, Brian
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote: Just
>> >>> to be clear (and play a little Devil's advocate :) ), I'm not
>> >>> suggesting that whatever a "contrib" project/module/subproject
>> >>> might become, be a clearinghouse for anything Storm-related.
>> >>>
>> >>> I see it as something that is well-vetted by the Storm
>> >>> community, subject to PPMC review, vote, etc. Entry would require
>> >>> community review, PPMC review, and in some cases ASF IP
>> >>> clearance/legal review. Anything added would require some level
>> >>> of commitment from the PPMC/committers to provide some level of
>> >>> support.
>> >>>
>> >>> In other words, nothing "willy-nilly".
>> >>>
>> >>> One option could be that any module added require (X > 0)  number
>> >>> of committers to volunteer as "sponsor"s for the module, and
>> >>> commit to maintaining it.
>> >>>
>> >>> That being said, I don't see storm-kafka being any different
>> >>> from anything else that provides integration points for Storm.
>> >>>
>> >>> -Taylor
>> >>>
>> >>>
>> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>> >>> <ma...@nathanmarz.com>>
>> >>> wrote:
>> >>>
>> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> >>> projects put these contrib modules in a "contrib" folder and keep
>> >>> them managed as completely separate codebases. As it's not
>> >>> actually a "module" necessary for Storm, there's an argument
>> >>> there for doing it that way rather than via the multi-module
>> >>> route.
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>> >>> <mpathira@umail.iu.edu
>> >>> <ma...@umail.iu.edu>>
>> >>> wrote: Hi Taylor,
>> >>>
>> >>> I'm +1 for pulling these external libraries into Apache codebase.
>> >>> This will certainly benifit Strom community. I also like to
>> >>> contribute to this process.
>> >>>
>> >>> Thanks Milinda
>> >>>
>> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>> A while back I opened STORM-206 [1] to capture ideas for
>> >>>> pulling in "contrib" modules to the Apache codebase.
>> >>>>
>> >>>> In the past, we had the storm-contrib github project [2] which
>> >>>> subsequently got broken up into individual projects hosted on
>> >>>> the stormprocessor github group [3] and elsewhere.
>> >>>>
>> >>>> The problem with this approach is that in certain cases it led
>> >>>> to code rot (modules not being updated in step with Storm's
>> >>>> API), fragmentation (multiple similar modules with the same
>> >>>> name), and confusion.
>> >>>>
>> >>>> A good example of this is the storm-kafka module [4], since it
>> >>>> is a widely used component. Because storm-contrib wasn't being
>> >>>> tagged in github, a lot of users had trouble reconciling with
>> >>>> which versions of storm it was compatible. Some users built off
>> >>>> specific commit hashes, some forked, and a few even pushed
>> >>>> custom builds to repositories such as clojars. With kafka 0.8
>> >>>> now available, there are two main storm-kafka projects, the
>> >>>> original (compatible with kafka 0.7) and an updated fork [5]
>> >>>> (compatible with kafka 0.8).
>> >>>>
>> >>>> My intention is not to find fault in any way, but rather to
>> >>>> point out the resulting pain, and work toward a better
>> >>>> solution.
>> >>>>
>> >>>> I think it would be beneficial to the Storm user community to
>> >>>> have certain commonly used modules like storm-kafka brought
>> >>>> into the Apache Storm project. Another benefit worth
>> >>>> considering is the licensing/legal oversight that the ASF
>> >>>> provides, which is important to many users.
>> >>>>
>> >>>> If this is something we want to do, then the big question
>> >>>> becomes what sort governance process needs to be established to
>> >>>> ensure that such things are properly maintained.
>> >>>>
>> >>>> Some random thoughts, questions, etc. that jump to mind
>> >>>> include:
>> >>>>
>> >>>> What to call these things: "contib modules", "connectors",
>> >>>> "integration modules", etc.? Build integration: I imagine they
>> >>>> would be a multi-module submodule of the main maven build.
>> >>>> Probably turned off by default and enabled by a maven profile.
>> >>>> Governance: Have one or more committer volunteers responsible
>> >>>> for maintenance, merging patches, etc.? Proposal process for
>> >>>> pulling new modules?
>> >>>>
>> >>>>
>> >>>> I look forward to hearing others' opinions.
>> >>>>
>> >>>> - Taylor
>> >>>>
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>> >>>> https://github.com/nathanmarz/storm-contrib [3]
>> >>>> https://github.com/stormprocessor [4]
>> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>> >
>>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>
>
>
>


-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

To clarify somewhat, the pull request for pulling in storm-starter [1] puts it in an “examples” directory. And there are suggestions to pull  James’ scheduler and testing examples in there as well. So there is a distinction between examples and other things like storm-kafka.

What I’m proposing is a different yet-to-be-named directory that would be home to things that integrate storm with other technologies.

In the storm-contrib README [2] the term used is “modules”. On the Storm website, we also use the term “adapter” [3].

- Taylor

[1] https://github.com/apache/incubator-storm/pull/44
[2] https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
[3] http://storm.incubator.apache.org/documentation/Spout-implementations.html


On Mar 13, 2014, at 1:24 AM, David Miller <da...@m-square.com.au> wrote:

> 
> what about both ?
> connectors for spout/bolt/states that connect to other tech, storm-kafka, storm-cassandra, etc
> extras for other things like storm-starter, storm-deploy, storm-puppet
> 
> 
> 
> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
> 
>> I don't like either name tbh. Storm itself is already broken into modules (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka are something different. I don't like "connectors" because something like storm-starter is not a connector. Maybe we call them "extras"?
>> 
>> I would say just to support 0.8.x of Kafka.
>> 
>> 
>> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
>> Incorporation of storm starter is underway.
>> 
>> I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.
>> 
>> I have the following questions for the community:
>> 
>> 1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".
>> 
>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.
>> 
>> - Taylor
>> 
>> 
>> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
>> >
>> > Thanks for starting this discussion, Taylor.
>> >
>> > As a user of Storm (and a small-scale contributor to storm-starter) as
>> > well as a user of Kafka, here are my $.02.
>> >
>> > [Storm and Kafka]
>> > First, I agree with Nathan that storm-kafka should be considered to be
>> > brought in.  While various "integrate Storm with X" options exist,
>> > basically everyone I have been talking to is using Kafka in
>> > combination with Storm.  I'm sure this is not a representative sample
>> > of Storm users, and of course one may or may not agree that Kafka is
>> > important enough of a technology in Storm's ecosystem.  Still, I do
>> > see the need to make sure Storm and Kafka do work together without
>> > having to go through forks of forks on GitHub and spending days to
>> > figure out how to get data from Kafka (0.8) into Storm.
>> >    Speaking of Kafka spout implementations, please don't forget
>> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>> > We've been quite happy with the former, so I'd suggest to at least
>> > consider both options here (maybe the two projects can even join forces?).
>> >
>> > [Storm examples, storm-starter]
>> > Second, IMHO every open source project should have a "1-click starting
>> > experience" for new users.  That's very much related to the project
>> > principles of tools like LogStash [1] who say: "Community: If a newbie
>> > has a bad time, it's a bug."  For this reason I personally would like
>> > to see the equivalent of storm-starter being brought into the "core"
>> > Storm project -- think of an examples/ sub-module.  If the level of
>> > effort is deemed too high to e.g. maintain what's already in
>> > storm-starter, then (say) reduce the scope and remove some of the
>> > examples.  In any case I'd personally would like to see bundled
>> > examples that are known to work with the latest version of Storm.
>> > storm-starter is often used to show new users how to get started with
>> > Storm (I used that approach in my Storm blog posts, for instance, and
>> > others like Mesosphere.io are even using storm-starter for their
>> > commercial offerings [2]).
>> >
>> > [Have Storm up and running faster than you can brew an espresso]
>> > Third, for the same reason (get people up and running in a few
>> > minutes), I do like that other people in this thread have been
>> > bringing up projects like storm-deploy.  For the same reason I have
>> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>> > few days ago, and I'll soon open source another Vagrant/Puppet based
>> > tool that provides you with 1-click local and remote deployments of
>> > Storm and Kafka clusters.  That's way better IMHO than having to
>> > follow long articles or blog posts to deploy your first cluster.  And
>> > there are a number of other people that have been rolling their own
>> > variants.  Now don't get me wrong -- I don't mention this to pitch any
>> > of those tools.  My intention is to say that it would be greatly
>> > helpful to have /something/ like this for Storm, for the same reason
>> > that it's nice to have LocalCluster for unit testing.  I have been
>> > demo'ing both Storm and Kafka by launching clusters with a simple
>> > command line, which always gets people excited.  If they can then rely
>> > on existing examples (see above) to also /run/ an analysis on "their"
>> > cluster then they have a beautiful start.
>> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>> > VM cluster setup, too [4] so that people can run the Aurora tutorial
>> > on their machines in a few minutes.
>> >
>> > [Storm and YARN]
>> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>> > would be nice.  It ties into being able to run LocalCluster as well as
>> > to run Storm in local or remote VMs -- but now alongside your existing
>> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>> > will surely be similarly attractive.
>> >
>> >
>> > On a related note bringing the Storm docs up to speed with the quality
>> > of the Storm code would also be great.  I have seen that since Storm
>> > moved to Incubator several new sections have been added such as the
>> > FAQ [5] (btw: nice!).
>> >
>> > Similarly, there should be better examples and docs for users how to
>> > write unit tests for Storm.  Right now people seem to be cobbling
>> > together their test code by figuring out how the 1-year old code in
>> > [6] actually works, and copy-pasting other people's test code from GitHub.
>> >
>> > --
>> >
>> > As I said above, these are my personal $.02.  I admit that my comments
>> > go a bit beyond the original question of bringing in contrib modules
>> > -- it think implicitly the discussion about the contrib modules also
>> > means "what do you need to provide a better and more well-rounded
>> > experience", i.e. the question whether to have batteries included or
>> > not. (As you may suspect I'm leaning towards included at least the
>> > most important batteries, though what's really "important" for on the
>> > project-level is of course up to debate.)
>> >
>> > On my side I'd be happy to help with those areas where I am able to
>> > contribute, whether that's code and examples (like storm-starter) or
>> > tutorials/docs (I already wrote e.g. [7] and [8]).
>> >
>> > Again, thanks Taylor for starting this discussion.  No matter the
>> > actual outcome I'm sure the state of the project will be improved.
>> >
>> > Best,
>> > Michael
>> >
>> >
>> >
>> > [1] https://github.com/elasticsearch/logstash
>> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>> > [3] https://github.com/miguno/puppet-storm
>> > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
>> > [6]
>> > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>> > [7]
>> > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>> > [8]
>> > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>> >
>> >
>> >
>> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> >> Thanks for the feedback Bobby.
>> >>
>> >> To clarify, I’m mainly talking about spout/bolt/trident state
>> >> implementations that integrate storm with *Technology X*, where
>> >> *Technology X* is not a fundamental part of storm.
>> >>
>> >> Examples would be technologies that are part of or related to the
>> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>> >> Kafka, HDFS, HBase, Cassandra, etc.
>> >>
>> >> The idea behind having one or more Storm committers act as a
>> >> “sponsor” is to make sure new additions are done carefully and with
>> >> good reason. To add a new module, it would require committer/PPMC
>> >> consensus, and assignment of one or more sponsors. Part of a
>> >> sponsor’s job would be to ensure that a module is maintained, which
>> >> would require enough familiarity with the code so support it long
>> >> term. If a new module was proposed, but no committers were willing
>> >> to act as a sponsor, it would not be added.
>> >>
>> >> It would be the Committers’/PPMC’s responsibly to make sure things
>> >> didn’t get out of hand, and to do something about it if it does.
>> >>
>> >> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> >> Hive as a contrib module, similar to what happened with HBase as
>> >> Bobby pointed out. Some interesting points are brought up. The
>> >> difference here is that both HBase and Hive were pretty big
>> >> codebases relative to Hadoop. With spout/bolt/state implementations
>> >> I doubt we’d see anything along that scale.
>> >>
>> >> - Taylor
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> >>
>> >>
>> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>> >> <ma...@yahoo-inc.com>> wrote:
>> >>
>> >>> I can see a lot of value in having a distribution of storm that
>> >>> comes with batteries included, everything is tested together and
>> >>> you know it works.  But I don’t see much long term developer
>> >>> benefit in building them all together.  If there is strong
>> >>> coupling between storm and these external projects so that they
>> >>> break when storm changes then we need to understand the coupling
>> >>> and decide if we want to reduce that coupling by stabilizing
>> >>> APIs, improving version numbering and release process, etc.; or
>> >>> if the functionality is something that should be offered as a
>> >>> base service in storm.
>> >>>
>> >>> I can see politically the value of giving these other projects a
>> >>> home in Apache, and making them sub-projects is the simplest
>> >>> route to that. I’d love to have storm on yarn inside Apache.  I
>> >>> just don’t want to go overboard with it.  There was a time when
>> >>> HBase was a “contrib” module under Hadoop along with a lot of
>> >>> other things, and the Apache board came and told Hadoop to brake
>> >>> it up.
>> >>>
>> >>> Bringing storm-kafka into storm does not sound like it will solve
>> >>> much from a developer’s perspective, because there is at least as
>> >>> much coupling with kafka as there is with storm.  I can see how
>> >>> it is a huge amount of overhead and pain to set up a new project
>> >>> just for a few hundred lines of code, as such I am in favor of
>> >>> pulling in closely related projects, especially those that are
>> >>> spouts and state implementations. I just want to be sure that we
>> >>> do it carefully, with a good reason, and with enough people who
>> >>> are familiar with the code to support it long term.
>> >>>
>> >>> If it starts to look like we are pulling in too many projects
>> >>> perhaps we should look at something more like the bigtop project
>> >>> https://bigtop.apache.org/ which produces a tested distribution
>> >>> of Hadoop with many different sub-projects included in it.
>> >>>
>> >>> I am also a bit concerned about these sub-projects becoming
>> >>> second class citizens, where we break something, but because the
>> >>> build is off by default we don’t know it.  I would prefer that
>> >>> they are built and tested by default.  If the build and test time
>> >>> starts to take too long, to me that means we need to start
>> >>> wondering if we have too many contrib modules.
>> >>>
>> >>> —Bobby
>> >>>
>> >>> From: Brian Enochson <brian.enochson@gmail.com
>> >>> <ma...@gmail.com>>
>> > Reply-To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Date: Tuesday, February 25, 2014 at 9:50 PM
>> >>> To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Cc: "dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> >>>
>> >>> hi, I am in agreement with Taylor and believe I understand his
>> >>> intent. An incredible tool/framework/application like Storm is
>> >>> only enhanced and gains value from the number of well maintained
>> >>> and vetted modules that can be used for integration and adding
>> >>> further functionality. I am relatively new to the Storm community
>> >>> but have spent quite some time reviewing contributing modules out
>> >>> there, reviewing various duplicates and running into some version
>> >>> incompatibilities. I understand the need to keep Storm itself
>> >>> pure, but do think there needs to be some structure and
>> >>> governance added to the contributing modules. Look at the benefit
>> >>> a tool like npm brings to the node community. I like the idea of
>> >>> sponsorship, vetting and a community vote.  I, as sure many would
>> >>> be, am willing to offer support and time to working through how
>> >>> to set this up and helping with the implementation if it is
>> >>> decided to pursue some solution. I hope these views are taken in
>> >>> the sprit they are made, to make this incredible system even
>> >>> better along with the surrounding eco-system.
>> >>>
>> >>> Thanks, Brian
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote: Just
>> >>> to be clear (and play a little Devil’s advocate :) ), I’m not
>> >>> suggesting that whatever a “contrib” project/module/subproject
>> >>> might become, be a clearinghouse for anything Storm-related.
>> >>>
>> >>> I see it as something that is well-vetted by the Storm
>> >>> community, subject to PPMC review, vote, etc. Entry would require
>> >>> community review, PPMC review, and in some cases ASF IP
>> >>> clearance/legal review. Anything added would require some level
>> >>> of commitment from the PPMC/committers to provide some level of
>> >>> support.
>> >>>
>> >>> In other words, nothing “willy-nilly”.
>> >>>
>> >>> One option could be that any module added require (X > 0)  number
>> >>> of committers to volunteer as “sponsor”s for the module, and
>> >>> commit to maintaining it.
>> >>>
>> >>> That being said, I don’t see storm-kafka being any different
>> >>> from anything else that provides integration points for Storm.
>> >>>
>> >>> -Taylor
>> >>>
>> >>>
>> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>> >>> <ma...@nathanmarz.com>>
>> >>> wrote:
>> >>>
>> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> >>> projects put these contrib modules in a "contrib" folder and keep
>> >>> them managed as completely separate codebases. As it's not
>> >>> actually a "module" necessary for Storm, there's an argument
>> >>> there for doing it that way rather than via the multi-module
>> >>> route.
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>> >>> <mpathira@umail.iu.edu
>> >>> <ma...@umail.iu.edu>>
>> >>> wrote: Hi Taylor,
>> >>>
>> >>> I'm +1 for pulling these external libraries into Apache codebase.
>> >>> This will certainly benifit Strom community. I also like to
>> >>> contribute to this process.
>> >>>
>> >>> Thanks Milinda
>> >>>
>> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>> A while back I opened STORM-206 [1] to capture ideas for
>> >>>> pulling in "contrib" modules to the Apache codebase.
>> >>>>
>> >>>> In the past, we had the storm-contrib github project [2] which
>> >>>> subsequently got broken up into individual projects hosted on
>> >>>> the stormprocessor github group [3] and elsewhere.
>> >>>>
>> >>>> The problem with this approach is that in certain cases it led
>> >>>> to code rot (modules not being updated in step with Storm's
>> >>>> API), fragmentation (multiple similar modules with the same
>> >>>> name), and confusion.
>> >>>>
>> >>>> A good example of this is the storm-kafka module [4], since it
>> >>>> is a widely used component. Because storm-contrib wasn't being
>> >>>> tagged in github, a lot of users had trouble reconciling with
>> >>>> which versions of storm it was compatible. Some users built off
>> >>>> specific commit hashes, some forked, and a few even pushed
>> >>>> custom builds to repositories such as clojars. With kafka 0.8
>> >>>> now available, there are two main storm-kafka projects, the
>> >>>> original (compatible with kafka 0.7) and an updated fork [5]
>> >>>> (compatible with kafka 0.8).
>> >>>>
>> >>>> My intention is not to find fault in any way, but rather to
>> >>>> point out the resulting pain, and work toward a better
>> >>>> solution.
>> >>>>
>> >>>> I think it would be beneficial to the Storm user community to
>> >>>> have certain commonly used modules like storm-kafka brought
>> >>>> into the Apache Storm project. Another benefit worth
>> >>>> considering is the licensing/legal oversight that the ASF
>> >>>> provides, which is important to many users.
>> >>>>
>> >>>> If this is something we want to do, then the big question
>> >>>> becomes what sort governance process needs to be established to
>> >>>> ensure that such things are properly maintained.
>> >>>>
>> >>>> Some random thoughts, questions, etc. that jump to mind
>> >>>> include:
>> >>>>
>> >>>> What to call these things: "contib modules", "connectors",
>> >>>> "integration modules", etc.? Build integration: I imagine they
>> >>>> would be a multi-module submodule of the main maven build.
>> >>>> Probably turned off by default and enabled by a maven profile.
>> >>>> Governance: Have one or more committer volunteers responsible
>> >>>> for maintenance, merging patches, etc.? Proposal process for
>> >>>> pulling new modules?
>> >>>>
>> >>>>
>> >>>> I look forward to hearing others' opinions.
>> >>>>
>> >>>> - Taylor
>> >>>>
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>> >>>> https://github.com/nathanmarz/storm-contrib [3]
>> >>>> https://github.com/stormprocessor [4]
>> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>> >
>> 
>> 
>> 
>> -- 
>> Twitter: @nathanmarz
>> http://nathanmarz.com
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

To clarify somewhat, the pull request for pulling in storm-starter [1] puts it in an “examples” directory. And there are suggestions to pull  James’ scheduler and testing examples in there as well. So there is a distinction between examples and other things like storm-kafka.

What I’m proposing is a different yet-to-be-named directory that would be home to things that integrate storm with other technologies.

In the storm-contrib README [2] the term used is “modules”. On the Storm website, we also use the term “adapter” [3].

- Taylor

[1] https://github.com/apache/incubator-storm/pull/44
[2] https://github.com/nathanmarz/storm-contrib/blob/master/README.md#about
[3] http://storm.incubator.apache.org/documentation/Spout-implementations.html


On Mar 13, 2014, at 1:24 AM, David Miller <da...@m-square.com.au> wrote:

> 
> what about both ?
> connectors for spout/bolt/states that connect to other tech, storm-kafka, storm-cassandra, etc
> extras for other things like storm-starter, storm-deploy, storm-puppet
> 
> 
> 
> On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:
> 
>> I don't like either name tbh. Storm itself is already broken into modules (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka are something different. I don't like "connectors" because something like storm-starter is not a connector. Maybe we call them "extras"?
>> 
>> I would say just to support 0.8.x of Kafka.
>> 
>> 
>> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
>> Incorporation of storm starter is underway.
>> 
>> I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.
>> 
>> I have the following questions for the community:
>> 
>> 1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".
>> 
>> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.
>> 
>> - Taylor
>> 
>> 
>> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
>> >
>> > Thanks for starting this discussion, Taylor.
>> >
>> > As a user of Storm (and a small-scale contributor to storm-starter) as
>> > well as a user of Kafka, here are my $.02.
>> >
>> > [Storm and Kafka]
>> > First, I agree with Nathan that storm-kafka should be considered to be
>> > brought in.  While various "integrate Storm with X" options exist,
>> > basically everyone I have been talking to is using Kafka in
>> > combination with Storm.  I'm sure this is not a representative sample
>> > of Storm users, and of course one may or may not agree that Kafka is
>> > important enough of a technology in Storm's ecosystem.  Still, I do
>> > see the need to make sure Storm and Kafka do work together without
>> > having to go through forks of forks on GitHub and spending days to
>> > figure out how to get data from Kafka (0.8) into Storm.
>> >    Speaking of Kafka spout implementations, please don't forget
>> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
>> > We've been quite happy with the former, so I'd suggest to at least
>> > consider both options here (maybe the two projects can even join forces?).
>> >
>> > [Storm examples, storm-starter]
>> > Second, IMHO every open source project should have a "1-click starting
>> > experience" for new users.  That's very much related to the project
>> > principles of tools like LogStash [1] who say: "Community: If a newbie
>> > has a bad time, it's a bug."  For this reason I personally would like
>> > to see the equivalent of storm-starter being brought into the "core"
>> > Storm project -- think of an examples/ sub-module.  If the level of
>> > effort is deemed too high to e.g. maintain what's already in
>> > storm-starter, then (say) reduce the scope and remove some of the
>> > examples.  In any case I'd personally would like to see bundled
>> > examples that are known to work with the latest version of Storm.
>> > storm-starter is often used to show new users how to get started with
>> > Storm (I used that approach in my Storm blog posts, for instance, and
>> > others like Mesosphere.io are even using storm-starter for their
>> > commercial offerings [2]).
>> >
>> > [Have Storm up and running faster than you can brew an espresso]
>> > Third, for the same reason (get people up and running in a few
>> > minutes), I do like that other people in this thread have been
>> > bringing up projects like storm-deploy.  For the same reason I have
>> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
>> > few days ago, and I'll soon open source another Vagrant/Puppet based
>> > tool that provides you with 1-click local and remote deployments of
>> > Storm and Kafka clusters.  That's way better IMHO than having to
>> > follow long articles or blog posts to deploy your first cluster.  And
>> > there are a number of other people that have been rolling their own
>> > variants.  Now don't get me wrong -- I don't mention this to pitch any
>> > of those tools.  My intention is to say that it would be greatly
>> > helpful to have /something/ like this for Storm, for the same reason
>> > that it's nice to have LocalCluster for unit testing.  I have been
>> > demo'ing both Storm and Kafka by launching clusters with a simple
>> > command line, which always gets people excited.  If they can then rely
>> > on existing examples (see above) to also /run/ an analysis on "their"
>> > cluster then they have a beautiful start.
>> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
>> > VM cluster setup, too [4] so that people can run the Aurora tutorial
>> > on their machines in a few minutes.
>> >
>> > [Storm and YARN]
>> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
>> > would be nice.  It ties into being able to run LocalCluster as well as
>> > to run Storm in local or remote VMs -- but now alongside your existing
>> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
>> > will surely be similarly attractive.
>> >
>> >
>> > On a related note bringing the Storm docs up to speed with the quality
>> > of the Storm code would also be great.  I have seen that since Storm
>> > moved to Incubator several new sections have been added such as the
>> > FAQ [5] (btw: nice!).
>> >
>> > Similarly, there should be better examples and docs for users how to
>> > write unit tests for Storm.  Right now people seem to be cobbling
>> > together their test code by figuring out how the 1-year old code in
>> > [6] actually works, and copy-pasting other people's test code from GitHub.
>> >
>> > --
>> >
>> > As I said above, these are my personal $.02.  I admit that my comments
>> > go a bit beyond the original question of bringing in contrib modules
>> > -- it think implicitly the discussion about the contrib modules also
>> > means "what do you need to provide a better and more well-rounded
>> > experience", i.e. the question whether to have batteries included or
>> > not. (As you may suspect I'm leaning towards included at least the
>> > most important batteries, though what's really "important" for on the
>> > project-level is of course up to debate.)
>> >
>> > On my side I'd be happy to help with those areas where I am able to
>> > contribute, whether that's code and examples (like storm-starter) or
>> > tutorials/docs (I already wrote e.g. [7] and [8]).
>> >
>> > Again, thanks Taylor for starting this discussion.  No matter the
>> > actual outcome I'm sure the state of the project will be improved.
>> >
>> > Best,
>> > Michael
>> >
>> >
>> >
>> > [1] https://github.com/elasticsearch/logstash
>> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
>> > [3] https://github.com/miguno/puppet-storm
>> > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
>> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
>> > [6]
>> > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
>> > [7]
>> > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
>> > [8]
>> > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
>> >
>> >
>> >
>> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> >> Thanks for the feedback Bobby.
>> >>
>> >> To clarify, I’m mainly talking about spout/bolt/trident state
>> >> implementations that integrate storm with *Technology X*, where
>> >> *Technology X* is not a fundamental part of storm.
>> >>
>> >> Examples would be technologies that are part of or related to the
>> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
>> >> Kafka, HDFS, HBase, Cassandra, etc.
>> >>
>> >> The idea behind having one or more Storm committers act as a
>> >> “sponsor” is to make sure new additions are done carefully and with
>> >> good reason. To add a new module, it would require committer/PPMC
>> >> consensus, and assignment of one or more sponsors. Part of a
>> >> sponsor’s job would be to ensure that a module is maintained, which
>> >> would require enough familiarity with the code so support it long
>> >> term. If a new module was proposed, but no committers were willing
>> >> to act as a sponsor, it would not be added.
>> >>
>> >> It would be the Committers’/PPMC’s responsibly to make sure things
>> >> didn’t get out of hand, and to do something about it if it does.
>> >>
>> >> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> >> Hive as a contrib module, similar to what happened with HBase as
>> >> Bobby pointed out. Some interesting points are brought up. The
>> >> difference here is that both HBase and Hive were pretty big
>> >> codebases relative to Hadoop. With spout/bolt/state implementations
>> >> I doubt we’d see anything along that scale.
>> >>
>> >> - Taylor
>> >>
>> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> >>
>> >>
>> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
>> >> <ma...@yahoo-inc.com>> wrote:
>> >>
>> >>> I can see a lot of value in having a distribution of storm that
>> >>> comes with batteries included, everything is tested together and
>> >>> you know it works.  But I don’t see much long term developer
>> >>> benefit in building them all together.  If there is strong
>> >>> coupling between storm and these external projects so that they
>> >>> break when storm changes then we need to understand the coupling
>> >>> and decide if we want to reduce that coupling by stabilizing
>> >>> APIs, improving version numbering and release process, etc.; or
>> >>> if the functionality is something that should be offered as a
>> >>> base service in storm.
>> >>>
>> >>> I can see politically the value of giving these other projects a
>> >>> home in Apache, and making them sub-projects is the simplest
>> >>> route to that. I’d love to have storm on yarn inside Apache.  I
>> >>> just don’t want to go overboard with it.  There was a time when
>> >>> HBase was a “contrib” module under Hadoop along with a lot of
>> >>> other things, and the Apache board came and told Hadoop to brake
>> >>> it up.
>> >>>
>> >>> Bringing storm-kafka into storm does not sound like it will solve
>> >>> much from a developer’s perspective, because there is at least as
>> >>> much coupling with kafka as there is with storm.  I can see how
>> >>> it is a huge amount of overhead and pain to set up a new project
>> >>> just for a few hundred lines of code, as such I am in favor of
>> >>> pulling in closely related projects, especially those that are
>> >>> spouts and state implementations. I just want to be sure that we
>> >>> do it carefully, with a good reason, and with enough people who
>> >>> are familiar with the code to support it long term.
>> >>>
>> >>> If it starts to look like we are pulling in too many projects
>> >>> perhaps we should look at something more like the bigtop project
>> >>> https://bigtop.apache.org/ which produces a tested distribution
>> >>> of Hadoop with many different sub-projects included in it.
>> >>>
>> >>> I am also a bit concerned about these sub-projects becoming
>> >>> second class citizens, where we break something, but because the
>> >>> build is off by default we don’t know it.  I would prefer that
>> >>> they are built and tested by default.  If the build and test time
>> >>> starts to take too long, to me that means we need to start
>> >>> wondering if we have too many contrib modules.
>> >>>
>> >>> —Bobby
>> >>>
>> >>> From: Brian Enochson <brian.enochson@gmail.com
>> >>> <ma...@gmail.com>>
>> > Reply-To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Date: Tuesday, February 25, 2014 at 9:50 PM
>> >>> To: "user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <user@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Cc: "dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>"
>> > <dev@storm.incubator.apache.org
>> >>> <ma...@storm.incubator.apache.org>>
>> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> >>>
>> >>> hi, I am in agreement with Taylor and believe I understand his
>> >>> intent. An incredible tool/framework/application like Storm is
>> >>> only enhanced and gains value from the number of well maintained
>> >>> and vetted modules that can be used for integration and adding
>> >>> further functionality. I am relatively new to the Storm community
>> >>> but have spent quite some time reviewing contributing modules out
>> >>> there, reviewing various duplicates and running into some version
>> >>> incompatibilities. I understand the need to keep Storm itself
>> >>> pure, but do think there needs to be some structure and
>> >>> governance added to the contributing modules. Look at the benefit
>> >>> a tool like npm brings to the node community. I like the idea of
>> >>> sponsorship, vetting and a community vote.  I, as sure many would
>> >>> be, am willing to offer support and time to working through how
>> >>> to set this up and helping with the implementation if it is
>> >>> decided to pursue some solution. I hope these views are taken in
>> >>> the sprit they are made, to make this incredible system even
>> >>> better along with the surrounding eco-system.
>> >>>
>> >>> Thanks, Brian
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote: Just
>> >>> to be clear (and play a little Devil’s advocate :) ), I’m not
>> >>> suggesting that whatever a “contrib” project/module/subproject
>> >>> might become, be a clearinghouse for anything Storm-related.
>> >>>
>> >>> I see it as something that is well-vetted by the Storm
>> >>> community, subject to PPMC review, vote, etc. Entry would require
>> >>> community review, PPMC review, and in some cases ASF IP
>> >>> clearance/legal review. Anything added would require some level
>> >>> of commitment from the PPMC/committers to provide some level of
>> >>> support.
>> >>>
>> >>> In other words, nothing “willy-nilly”.
>> >>>
>> >>> One option could be that any module added require (X > 0)  number
>> >>> of committers to volunteer as “sponsor”s for the module, and
>> >>> commit to maintaining it.
>> >>>
>> >>> That being said, I don’t see storm-kafka being any different
>> >>> from anything else that provides integration points for Storm.
>> >>>
>> >>> -Taylor
>> >>>
>> >>>
>> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
>> >>> <ma...@nathanmarz.com>>
>> >>> wrote:
>> >>>
>> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> >>> projects put these contrib modules in a "contrib" folder and keep
>> >>> them managed as completely separate codebases. As it's not
>> >>> actually a "module" necessary for Storm, there's an argument
>> >>> there for doing it that way rather than via the multi-module
>> >>> route.
>> >>>
>> >>>
>> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>> >>> <mpathira@umail.iu.edu
>> >>> <ma...@umail.iu.edu>>
>> >>> wrote: Hi Taylor,
>> >>>
>> >>> I'm +1 for pulling these external libraries into Apache codebase.
>> >>> This will certainly benifit Strom community. I also like to
>> >>> contribute to this process.
>> >>>
>> >>> Thanks Milinda
>> >>>
>> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> >>> <ptgoetz@gmail.com
>> >>> <ma...@gmail.com>> wrote:
>> >>>> A while back I opened STORM-206 [1] to capture ideas for
>> >>>> pulling in "contrib" modules to the Apache codebase.
>> >>>>
>> >>>> In the past, we had the storm-contrib github project [2] which
>> >>>> subsequently got broken up into individual projects hosted on
>> >>>> the stormprocessor github group [3] and elsewhere.
>> >>>>
>> >>>> The problem with this approach is that in certain cases it led
>> >>>> to code rot (modules not being updated in step with Storm's
>> >>>> API), fragmentation (multiple similar modules with the same
>> >>>> name), and confusion.
>> >>>>
>> >>>> A good example of this is the storm-kafka module [4], since it
>> >>>> is a widely used component. Because storm-contrib wasn't being
>> >>>> tagged in github, a lot of users had trouble reconciling with
>> >>>> which versions of storm it was compatible. Some users built off
>> >>>> specific commit hashes, some forked, and a few even pushed
>> >>>> custom builds to repositories such as clojars. With kafka 0.8
>> >>>> now available, there are two main storm-kafka projects, the
>> >>>> original (compatible with kafka 0.7) and an updated fork [5]
>> >>>> (compatible with kafka 0.8).
>> >>>>
>> >>>> My intention is not to find fault in any way, but rather to
>> >>>> point out the resulting pain, and work toward a better
>> >>>> solution.
>> >>>>
>> >>>> I think it would be beneficial to the Storm user community to
>> >>>> have certain commonly used modules like storm-kafka brought
>> >>>> into the Apache Storm project. Another benefit worth
>> >>>> considering is the licensing/legal oversight that the ASF
>> >>>> provides, which is important to many users.
>> >>>>
>> >>>> If this is something we want to do, then the big question
>> >>>> becomes what sort governance process needs to be established to
>> >>>> ensure that such things are properly maintained.
>> >>>>
>> >>>> Some random thoughts, questions, etc. that jump to mind
>> >>>> include:
>> >>>>
>> >>>> What to call these things: "contib modules", "connectors",
>> >>>> "integration modules", etc.? Build integration: I imagine they
>> >>>> would be a multi-module submodule of the main maven build.
>> >>>> Probably turned off by default and enabled by a maven profile.
>> >>>> Governance: Have one or more committer volunteers responsible
>> >>>> for maintenance, merging patches, etc.? Proposal process for
>> >>>> pulling new modules?
>> >>>>
>> >>>>
>> >>>> I look forward to hearing others' opinions.
>> >>>>
>> >>>> - Taylor
>> >>>>
>> >>>>
>> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>> >>>> https://github.com/nathanmarz/storm-contrib [3]
>> >>>> https://github.com/stormprocessor [4]
>> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>> >
>> 
>> 
>> 
>> -- 
>> Twitter: @nathanmarz
>> http://nathanmarz.com
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by David Miller <da...@m-square.com.au>.

what about both ?
connectors for spout/bolt/states that connect to other tech, storm-kafka, storm-cassandra, etc
extras for other things like storm-starter, storm-deploy, storm-puppet



On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:

> I don't like either name tbh. Storm itself is already broken into modules (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka are something different. I don't like "connectors" because something like storm-starter is not a connector. Maybe we call them "extras"?
> 
> I would say just to support 0.8.x of Kafka.
> 
> 
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> Incorporation of storm starter is underway.
> 
> I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.
> 
> I have the following questions for the community:
> 
> 1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".
> 
> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.
> 
> - Taylor
> 
> 
> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
> >
> > Thanks for starting this discussion, Taylor.
> >
> > As a user of Storm (and a small-scale contributor to storm-starter) as
> > well as a user of Kafka, here are my $.02.
> >
> > [Storm and Kafka]
> > First, I agree with Nathan that storm-kafka should be considered to be
> > brought in.  While various "integrate Storm with X" options exist,
> > basically everyone I have been talking to is using Kafka in
> > combination with Storm.  I'm sure this is not a representative sample
> > of Storm users, and of course one may or may not agree that Kafka is
> > important enough of a technology in Storm's ecosystem.  Still, I do
> > see the need to make sure Storm and Kafka do work together without
> > having to go through forks of forks on GitHub and spending days to
> > figure out how to get data from Kafka (0.8) into Storm.
> >    Speaking of Kafka spout implementations, please don't forget
> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> > We've been quite happy with the former, so I'd suggest to at least
> > consider both options here (maybe the two projects can even join forces?).
> >
> > [Storm examples, storm-starter]
> > Second, IMHO every open source project should have a "1-click starting
> > experience" for new users.  That's very much related to the project
> > principles of tools like LogStash [1] who say: "Community: If a newbie
> > has a bad time, it's a bug."  For this reason I personally would like
> > to see the equivalent of storm-starter being brought into the "core"
> > Storm project -- think of an examples/ sub-module.  If the level of
> > effort is deemed too high to e.g. maintain what's already in
> > storm-starter, then (say) reduce the scope and remove some of the
> > examples.  In any case I'd personally would like to see bundled
> > examples that are known to work with the latest version of Storm.
> > storm-starter is often used to show new users how to get started with
> > Storm (I used that approach in my Storm blog posts, for instance, and
> > others like Mesosphere.io are even using storm-starter for their
> > commercial offerings [2]).
> >
> > [Have Storm up and running faster than you can brew an espresso]
> > Third, for the same reason (get people up and running in a few
> > minutes), I do like that other people in this thread have been
> > bringing up projects like storm-deploy.  For the same reason I have
> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> > few days ago, and I'll soon open source another Vagrant/Puppet based
> > tool that provides you with 1-click local and remote deployments of
> > Storm and Kafka clusters.  That's way better IMHO than having to
> > follow long articles or blog posts to deploy your first cluster.  And
> > there are a number of other people that have been rolling their own
> > variants.  Now don't get me wrong -- I don't mention this to pitch any
> > of those tools.  My intention is to say that it would be greatly
> > helpful to have /something/ like this for Storm, for the same reason
> > that it's nice to have LocalCluster for unit testing.  I have been
> > demo'ing both Storm and Kafka by launching clusters with a simple
> > command line, which always gets people excited.  If they can then rely
> > on existing examples (see above) to also /run/ an analysis on "their"
> > cluster then they have a beautiful start.
> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> > VM cluster setup, too [4] so that people can run the Aurora tutorial
> > on their machines in a few minutes.
> >
> > [Storm and YARN]
> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> > would be nice.  It ties into being able to run LocalCluster as well as
> > to run Storm in local or remote VMs -- but now alongside your existing
> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> > will surely be similarly attractive.
> >
> >
> > On a related note bringing the Storm docs up to speed with the quality
> > of the Storm code would also be great.  I have seen that since Storm
> > moved to Incubator several new sections have been added such as the
> > FAQ [5] (btw: nice!).
> >
> > Similarly, there should be better examples and docs for users how to
> > write unit tests for Storm.  Right now people seem to be cobbling
> > together their test code by figuring out how the 1-year old code in
> > [6] actually works, and copy-pasting other people's test code from GitHub.
> >
> > --
> >
> > As I said above, these are my personal $.02.  I admit that my comments
> > go a bit beyond the original question of bringing in contrib modules
> > -- it think implicitly the discussion about the contrib modules also
> > means "what do you need to provide a better and more well-rounded
> > experience", i.e. the question whether to have batteries included or
> > not. (As you may suspect I'm leaning towards included at least the
> > most important batteries, though what's really "important" for on the
> > project-level is of course up to debate.)
> >
> > On my side I'd be happy to help with those areas where I am able to
> > contribute, whether that's code and examples (like storm-starter) or
> > tutorials/docs (I already wrote e.g. [7] and [8]).
> >
> > Again, thanks Taylor for starting this discussion.  No matter the
> > actual outcome I'm sure the state of the project will be improved.
> >
> > Best,
> > Michael
> >
> >
> >
> > [1] https://github.com/elasticsearch/logstash
> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> > [3] https://github.com/miguno/puppet-storm
> > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
> > [6]
> > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> > [7]
> > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> > [8]
> > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >
> >
> >
> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >> Thanks for the feedback Bobby.
> >>
> >> To clarify, I’m mainly talking about spout/bolt/trident state
> >> implementations that integrate storm with *Technology X*, where
> >> *Technology X* is not a fundamental part of storm.
> >>
> >> Examples would be technologies that are part of or related to the
> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >> Kafka, HDFS, HBase, Cassandra, etc.
> >>
> >> The idea behind having one or more Storm committers act as a
> >> “sponsor” is to make sure new additions are done carefully and with
> >> good reason. To add a new module, it would require committer/PPMC
> >> consensus, and assignment of one or more sponsors. Part of a
> >> sponsor’s job would be to ensure that a module is maintained, which
> >> would require enough familiarity with the code so support it long
> >> term. If a new module was proposed, but no committers were willing
> >> to act as a sponsor, it would not be added.
> >>
> >> It would be the Committers’/PPMC’s responsibly to make sure things
> >> didn’t get out of hand, and to do something about it if it does.
> >>
> >> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> >> Hive as a contrib module, similar to what happened with HBase as
> >> Bobby pointed out. Some interesting points are brought up. The
> >> difference here is that both HBase and Hive were pretty big
> >> codebases relative to Hadoop. With spout/bolt/state implementations
> >> I doubt we’d see anything along that scale.
> >>
> >> - Taylor
> >>
> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>
> >>
> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
> >> <ma...@yahoo-inc.com>> wrote:
> >>
> >>> I can see a lot of value in having a distribution of storm that
> >>> comes with batteries included, everything is tested together and
> >>> you know it works.  But I don’t see much long term developer
> >>> benefit in building them all together.  If there is strong
> >>> coupling between storm and these external projects so that they
> >>> break when storm changes then we need to understand the coupling
> >>> and decide if we want to reduce that coupling by stabilizing
> >>> APIs, improving version numbering and release process, etc.; or
> >>> if the functionality is something that should be offered as a
> >>> base service in storm.
> >>>
> >>> I can see politically the value of giving these other projects a
> >>> home in Apache, and making them sub-projects is the simplest
> >>> route to that. I’d love to have storm on yarn inside Apache.  I
> >>> just don’t want to go overboard with it.  There was a time when
> >>> HBase was a “contrib” module under Hadoop along with a lot of
> >>> other things, and the Apache board came and told Hadoop to brake
> >>> it up.
> >>>
> >>> Bringing storm-kafka into storm does not sound like it will solve
> >>> much from a developer’s perspective, because there is at least as
> >>> much coupling with kafka as there is with storm.  I can see how
> >>> it is a huge amount of overhead and pain to set up a new project
> >>> just for a few hundred lines of code, as such I am in favor of
> >>> pulling in closely related projects, especially those that are
> >>> spouts and state implementations. I just want to be sure that we
> >>> do it carefully, with a good reason, and with enough people who
> >>> are familiar with the code to support it long term.
> >>>
> >>> If it starts to look like we are pulling in too many projects
> >>> perhaps we should look at something more like the bigtop project
> >>> https://bigtop.apache.org/ which produces a tested distribution
> >>> of Hadoop with many different sub-projects included in it.
> >>>
> >>> I am also a bit concerned about these sub-projects becoming
> >>> second class citizens, where we break something, but because the
> >>> build is off by default we don’t know it.  I would prefer that
> >>> they are built and tested by default.  If the build and test time
> >>> starts to take too long, to me that means we need to start
> >>> wondering if we have too many contrib modules.
> >>>
> >>> —Bobby
> >>>
> >>> From: Brian Enochson <brian.enochson@gmail.com
> >>> <ma...@gmail.com>>
> > Reply-To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Date: Tuesday, February 25, 2014 at 9:50 PM
> >>> To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Cc: "dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>
> >>> hi, I am in agreement with Taylor and believe I understand his
> >>> intent. An incredible tool/framework/application like Storm is
> >>> only enhanced and gains value from the number of well maintained
> >>> and vetted modules that can be used for integration and adding
> >>> further functionality. I am relatively new to the Storm community
> >>> but have spent quite some time reviewing contributing modules out
> >>> there, reviewing various duplicates and running into some version
> >>> incompatibilities. I understand the need to keep Storm itself
> >>> pure, but do think there needs to be some structure and
> >>> governance added to the contributing modules. Look at the benefit
> >>> a tool like npm brings to the node community. I like the idea of
> >>> sponsorship, vetting and a community vote.  I, as sure many would
> >>> be, am willing to offer support and time to working through how
> >>> to set this up and helping with the implementation if it is
> >>> decided to pursue some solution. I hope these views are taken in
> >>> the sprit they are made, to make this incredible system even
> >>> better along with the surrounding eco-system.
> >>>
> >>> Thanks, Brian
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote: Just
> >>> to be clear (and play a little Devil’s advocate :) ), I’m not
> >>> suggesting that whatever a “contrib” project/module/subproject
> >>> might become, be a clearinghouse for anything Storm-related.
> >>>
> >>> I see it as something that is well-vetted by the Storm
> >>> community, subject to PPMC review, vote, etc. Entry would require
> >>> community review, PPMC review, and in some cases ASF IP
> >>> clearance/legal review. Anything added would require some level
> >>> of commitment from the PPMC/committers to provide some level of
> >>> support.
> >>>
> >>> In other words, nothing “willy-nilly”.
> >>>
> >>> One option could be that any module added require (X > 0)  number
> >>> of committers to volunteer as “sponsor”s for the module, and
> >>> commit to maintaining it.
> >>>
> >>> That being said, I don’t see storm-kafka being any different
> >>> from anything else that provides integration points for Storm.
> >>>
> >>> -Taylor
> >>>
> >>>
> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
> >>> <ma...@nathanmarz.com>>
> >>> wrote:
> >>>
> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>> projects put these contrib modules in a "contrib" folder and keep
> >>> them managed as completely separate codebases. As it's not
> >>> actually a "module" necessary for Storm, there's an argument
> >>> there for doing it that way rather than via the multi-module
> >>> route.
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>> <mpathira@umail.iu.edu
> >>> <ma...@umail.iu.edu>>
> >>> wrote: Hi Taylor,
> >>>
> >>> I'm +1 for pulling these external libraries into Apache codebase.
> >>> This will certainly benifit Strom community. I also like to
> >>> contribute to this process.
> >>>
> >>> Thanks Milinda
> >>>
> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>> pulling in "contrib" modules to the Apache codebase.
> >>>>
> >>>> In the past, we had the storm-contrib github project [2] which
> >>>> subsequently got broken up into individual projects hosted on
> >>>> the stormprocessor github group [3] and elsewhere.
> >>>>
> >>>> The problem with this approach is that in certain cases it led
> >>>> to code rot (modules not being updated in step with Storm's
> >>>> API), fragmentation (multiple similar modules with the same
> >>>> name), and confusion.
> >>>>
> >>>> A good example of this is the storm-kafka module [4], since it
> >>>> is a widely used component. Because storm-contrib wasn't being
> >>>> tagged in github, a lot of users had trouble reconciling with
> >>>> which versions of storm it was compatible. Some users built off
> >>>> specific commit hashes, some forked, and a few even pushed
> >>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>> now available, there are two main storm-kafka projects, the
> >>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>> (compatible with kafka 0.8).
> >>>>
> >>>> My intention is not to find fault in any way, but rather to
> >>>> point out the resulting pain, and work toward a better
> >>>> solution.
> >>>>
> >>>> I think it would be beneficial to the Storm user community to
> >>>> have certain commonly used modules like storm-kafka brought
> >>>> into the Apache Storm project. Another benefit worth
> >>>> considering is the licensing/legal oversight that the ASF
> >>>> provides, which is important to many users.
> >>>>
> >>>> If this is something we want to do, then the big question
> >>>> becomes what sort governance process needs to be established to
> >>>> ensure that such things are properly maintained.
> >>>>
> >>>> Some random thoughts, questions, etc. that jump to mind
> >>>> include:
> >>>>
> >>>> What to call these things: "contib modules", "connectors",
> >>>> "integration modules", etc.? Build integration: I imagine they
> >>>> would be a multi-module submodule of the main maven build.
> >>>> Probably turned off by default and enabled by a maven profile.
> >>>> Governance: Have one or more committer volunteers responsible
> >>>> for maintenance, merging patches, etc.? Proposal process for
> >>>> pulling new modules?
> >>>>
> >>>>
> >>>> I look forward to hearing others' opinions.
> >>>>
> >>>> - Taylor
> >>>>
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>> https://github.com/stormprocessor [4]
> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >
> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by David Miller <da...@m-square.com.au>.

what about both ?
connectors for spout/bolt/states that connect to other tech, storm-kafka, storm-cassandra, etc
extras for other things like storm-starter, storm-deploy, storm-puppet



On 13 Mar 2014, at 3:57 pm, Nathan Marz <na...@nathanmarz.com> wrote:

> I don't like either name tbh. Storm itself is already broken into modules (storm-core, storm-netty, etc) and things like storm-starter and storm-kafka are something different. I don't like "connectors" because something like storm-starter is not a connector. Maybe we call them "extras"?
> 
> I would say just to support 0.8.x of Kafka.
> 
> 
> On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> Incorporation of storm starter is underway.
> 
> I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.
> 
> I have the following questions for the community:
> 
> 1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".
> 
> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.
> 
> - Taylor
> 
> 
> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
> >
> > Thanks for starting this discussion, Taylor.
> >
> > As a user of Storm (and a small-scale contributor to storm-starter) as
> > well as a user of Kafka, here are my $.02.
> >
> > [Storm and Kafka]
> > First, I agree with Nathan that storm-kafka should be considered to be
> > brought in.  While various "integrate Storm with X" options exist,
> > basically everyone I have been talking to is using Kafka in
> > combination with Storm.  I'm sure this is not a representative sample
> > of Storm users, and of course one may or may not agree that Kafka is
> > important enough of a technology in Storm's ecosystem.  Still, I do
> > see the need to make sure Storm and Kafka do work together without
> > having to go through forks of forks on GitHub and spending days to
> > figure out how to get data from Kafka (0.8) into Storm.
> >    Speaking of Kafka spout implementations, please don't forget
> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> > We've been quite happy with the former, so I'd suggest to at least
> > consider both options here (maybe the two projects can even join forces?).
> >
> > [Storm examples, storm-starter]
> > Second, IMHO every open source project should have a "1-click starting
> > experience" for new users.  That's very much related to the project
> > principles of tools like LogStash [1] who say: "Community: If a newbie
> > has a bad time, it's a bug."  For this reason I personally would like
> > to see the equivalent of storm-starter being brought into the "core"
> > Storm project -- think of an examples/ sub-module.  If the level of
> > effort is deemed too high to e.g. maintain what's already in
> > storm-starter, then (say) reduce the scope and remove some of the
> > examples.  In any case I'd personally would like to see bundled
> > examples that are known to work with the latest version of Storm.
> > storm-starter is often used to show new users how to get started with
> > Storm (I used that approach in my Storm blog posts, for instance, and
> > others like Mesosphere.io are even using storm-starter for their
> > commercial offerings [2]).
> >
> > [Have Storm up and running faster than you can brew an espresso]
> > Third, for the same reason (get people up and running in a few
> > minutes), I do like that other people in this thread have been
> > bringing up projects like storm-deploy.  For the same reason I have
> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> > few days ago, and I'll soon open source another Vagrant/Puppet based
> > tool that provides you with 1-click local and remote deployments of
> > Storm and Kafka clusters.  That's way better IMHO than having to
> > follow long articles or blog posts to deploy your first cluster.  And
> > there are a number of other people that have been rolling their own
> > variants.  Now don't get me wrong -- I don't mention this to pitch any
> > of those tools.  My intention is to say that it would be greatly
> > helpful to have /something/ like this for Storm, for the same reason
> > that it's nice to have LocalCluster for unit testing.  I have been
> > demo'ing both Storm and Kafka by launching clusters with a simple
> > command line, which always gets people excited.  If they can then rely
> > on existing examples (see above) to also /run/ an analysis on "their"
> > cluster then they have a beautiful start.
> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> > VM cluster setup, too [4] so that people can run the Aurora tutorial
> > on their machines in a few minutes.
> >
> > [Storm and YARN]
> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> > would be nice.  It ties into being able to run LocalCluster as well as
> > to run Storm in local or remote VMs -- but now alongside your existing
> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> > will surely be similarly attractive.
> >
> >
> > On a related note bringing the Storm docs up to speed with the quality
> > of the Storm code would also be great.  I have seen that since Storm
> > moved to Incubator several new sections have been added such as the
> > FAQ [5] (btw: nice!).
> >
> > Similarly, there should be better examples and docs for users how to
> > write unit tests for Storm.  Right now people seem to be cobbling
> > together their test code by figuring out how the 1-year old code in
> > [6] actually works, and copy-pasting other people's test code from GitHub.
> >
> > --
> >
> > As I said above, these are my personal $.02.  I admit that my comments
> > go a bit beyond the original question of bringing in contrib modules
> > -- it think implicitly the discussion about the contrib modules also
> > means "what do you need to provide a better and more well-rounded
> > experience", i.e. the question whether to have batteries included or
> > not. (As you may suspect I'm leaning towards included at least the
> > most important batteries, though what's really "important" for on the
> > project-level is of course up to debate.)
> >
> > On my side I'd be happy to help with those areas where I am able to
> > contribute, whether that's code and examples (like storm-starter) or
> > tutorials/docs (I already wrote e.g. [7] and [8]).
> >
> > Again, thanks Taylor for starting this discussion.  No matter the
> > actual outcome I'm sure the state of the project will be improved.
> >
> > Best,
> > Michael
> >
> >
> >
> > [1] https://github.com/elasticsearch/logstash
> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> > [3] https://github.com/miguno/puppet-storm
> > [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
> > [6]
> > https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> > [7]
> > https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> > [8]
> > http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >
> >
> >
> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >> Thanks for the feedback Bobby.
> >>
> >> To clarify, I’m mainly talking about spout/bolt/trident state
> >> implementations that integrate storm with *Technology X*, where
> >> *Technology X* is not a fundamental part of storm.
> >>
> >> Examples would be technologies that are part of or related to the
> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >> Kafka, HDFS, HBase, Cassandra, etc.
> >>
> >> The idea behind having one or more Storm committers act as a
> >> “sponsor” is to make sure new additions are done carefully and with
> >> good reason. To add a new module, it would require committer/PPMC
> >> consensus, and assignment of one or more sponsors. Part of a
> >> sponsor’s job would be to ensure that a module is maintained, which
> >> would require enough familiarity with the code so support it long
> >> term. If a new module was proposed, but no committers were willing
> >> to act as a sponsor, it would not be added.
> >>
> >> It would be the Committers’/PPMC’s responsibly to make sure things
> >> didn’t get out of hand, and to do something about it if it does.
> >>
> >> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> >> Hive as a contrib module, similar to what happened with HBase as
> >> Bobby pointed out. Some interesting points are brought up. The
> >> difference here is that both HBase and Hive were pretty big
> >> codebases relative to Hadoop. With spout/bolt/state implementations
> >> I doubt we’d see anything along that scale.
> >>
> >> - Taylor
> >>
> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>
> >>
> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
> >> <ma...@yahoo-inc.com>> wrote:
> >>
> >>> I can see a lot of value in having a distribution of storm that
> >>> comes with batteries included, everything is tested together and
> >>> you know it works.  But I don’t see much long term developer
> >>> benefit in building them all together.  If there is strong
> >>> coupling between storm and these external projects so that they
> >>> break when storm changes then we need to understand the coupling
> >>> and decide if we want to reduce that coupling by stabilizing
> >>> APIs, improving version numbering and release process, etc.; or
> >>> if the functionality is something that should be offered as a
> >>> base service in storm.
> >>>
> >>> I can see politically the value of giving these other projects a
> >>> home in Apache, and making them sub-projects is the simplest
> >>> route to that. I’d love to have storm on yarn inside Apache.  I
> >>> just don’t want to go overboard with it.  There was a time when
> >>> HBase was a “contrib” module under Hadoop along with a lot of
> >>> other things, and the Apache board came and told Hadoop to brake
> >>> it up.
> >>>
> >>> Bringing storm-kafka into storm does not sound like it will solve
> >>> much from a developer’s perspective, because there is at least as
> >>> much coupling with kafka as there is with storm.  I can see how
> >>> it is a huge amount of overhead and pain to set up a new project
> >>> just for a few hundred lines of code, as such I am in favor of
> >>> pulling in closely related projects, especially those that are
> >>> spouts and state implementations. I just want to be sure that we
> >>> do it carefully, with a good reason, and with enough people who
> >>> are familiar with the code to support it long term.
> >>>
> >>> If it starts to look like we are pulling in too many projects
> >>> perhaps we should look at something more like the bigtop project
> >>> https://bigtop.apache.org/ which produces a tested distribution
> >>> of Hadoop with many different sub-projects included in it.
> >>>
> >>> I am also a bit concerned about these sub-projects becoming
> >>> second class citizens, where we break something, but because the
> >>> build is off by default we don’t know it.  I would prefer that
> >>> they are built and tested by default.  If the build and test time
> >>> starts to take too long, to me that means we need to start
> >>> wondering if we have too many contrib modules.
> >>>
> >>> —Bobby
> >>>
> >>> From: Brian Enochson <brian.enochson@gmail.com
> >>> <ma...@gmail.com>>
> > Reply-To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Date: Tuesday, February 25, 2014 at 9:50 PM
> >>> To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Cc: "dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>"
> > <dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org>>
> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>
> >>> hi, I am in agreement with Taylor and believe I understand his
> >>> intent. An incredible tool/framework/application like Storm is
> >>> only enhanced and gains value from the number of well maintained
> >>> and vetted modules that can be used for integration and adding
> >>> further functionality. I am relatively new to the Storm community
> >>> but have spent quite some time reviewing contributing modules out
> >>> there, reviewing various duplicates and running into some version
> >>> incompatibilities. I understand the need to keep Storm itself
> >>> pure, but do think there needs to be some structure and
> >>> governance added to the contributing modules. Look at the benefit
> >>> a tool like npm brings to the node community. I like the idea of
> >>> sponsorship, vetting and a community vote.  I, as sure many would
> >>> be, am willing to offer support and time to working through how
> >>> to set this up and helping with the implementation if it is
> >>> decided to pursue some solution. I hope these views are taken in
> >>> the sprit they are made, to make this incredible system even
> >>> better along with the surrounding eco-system.
> >>>
> >>> Thanks, Brian
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote: Just
> >>> to be clear (and play a little Devil’s advocate :) ), I’m not
> >>> suggesting that whatever a “contrib” project/module/subproject
> >>> might become, be a clearinghouse for anything Storm-related.
> >>>
> >>> I see it as something that is well-vetted by the Storm
> >>> community, subject to PPMC review, vote, etc. Entry would require
> >>> community review, PPMC review, and in some cases ASF IP
> >>> clearance/legal review. Anything added would require some level
> >>> of commitment from the PPMC/committers to provide some level of
> >>> support.
> >>>
> >>> In other words, nothing “willy-nilly”.
> >>>
> >>> One option could be that any module added require (X > 0)  number
> >>> of committers to volunteer as “sponsor”s for the module, and
> >>> commit to maintaining it.
> >>>
> >>> That being said, I don’t see storm-kafka being any different
> >>> from anything else that provides integration points for Storm.
> >>>
> >>> -Taylor
> >>>
> >>>
> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
> >>> <ma...@nathanmarz.com>>
> >>> wrote:
> >>>
> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>> projects put these contrib modules in a "contrib" folder and keep
> >>> them managed as completely separate codebases. As it's not
> >>> actually a "module" necessary for Storm, there's an argument
> >>> there for doing it that way rather than via the multi-module
> >>> route.
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>> <mpathira@umail.iu.edu
> >>> <ma...@umail.iu.edu>>
> >>> wrote: Hi Taylor,
> >>>
> >>> I'm +1 for pulling these external libraries into Apache codebase.
> >>> This will certainly benifit Strom community. I also like to
> >>> contribute to this process.
> >>>
> >>> Thanks Milinda
> >>>
> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>> pulling in "contrib" modules to the Apache codebase.
> >>>>
> >>>> In the past, we had the storm-contrib github project [2] which
> >>>> subsequently got broken up into individual projects hosted on
> >>>> the stormprocessor github group [3] and elsewhere.
> >>>>
> >>>> The problem with this approach is that in certain cases it led
> >>>> to code rot (modules not being updated in step with Storm's
> >>>> API), fragmentation (multiple similar modules with the same
> >>>> name), and confusion.
> >>>>
> >>>> A good example of this is the storm-kafka module [4], since it
> >>>> is a widely used component. Because storm-contrib wasn't being
> >>>> tagged in github, a lot of users had trouble reconciling with
> >>>> which versions of storm it was compatible. Some users built off
> >>>> specific commit hashes, some forked, and a few even pushed
> >>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>> now available, there are two main storm-kafka projects, the
> >>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>> (compatible with kafka 0.8).
> >>>>
> >>>> My intention is not to find fault in any way, but rather to
> >>>> point out the resulting pain, and work toward a better
> >>>> solution.
> >>>>
> >>>> I think it would be beneficial to the Storm user community to
> >>>> have certain commonly used modules like storm-kafka brought
> >>>> into the Apache Storm project. Another benefit worth
> >>>> considering is the licensing/legal oversight that the ASF
> >>>> provides, which is important to many users.
> >>>>
> >>>> If this is something we want to do, then the big question
> >>>> becomes what sort governance process needs to be established to
> >>>> ensure that such things are properly maintained.
> >>>>
> >>>> Some random thoughts, questions, etc. that jump to mind
> >>>> include:
> >>>>
> >>>> What to call these things: "contib modules", "connectors",
> >>>> "integration modules", etc.? Build integration: I imagine they
> >>>> would be a multi-module submodule of the main maven build.
> >>>> Probably turned off by default and enabled by a maven profile.
> >>>> Governance: Have one or more committer volunteers responsible
> >>>> for maintenance, merging patches, etc.? Proposal process for
> >>>> pulling new modules?
> >>>>
> >>>>
> >>>> I look forward to hearing others' opinions.
> >>>>
> >>>> - Taylor
> >>>>
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>> https://github.com/stormprocessor [4]
> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >
> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

I don't like either name tbh. Storm itself is already broken into modules
(storm-core, storm-netty, etc) and things like storm-starter and
storm-kafka are something different. I don't like "connectors" because
something like storm-starter is not a connector. Maybe we call them
"extras"?

I would say just to support 0.8.x of Kafka.


On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Incorporation of storm starter is underway.
>
> I'd like to turn the attention to kafka, with the goal being to pull in
> kafka support that is maintained and will be known to be compatible with
> the current version of storm and specific version(s) of kafka.
>
> I have the following questions for the community:
>
> 1. What do we want to call additions like this? I'm leaning toward
> "modules" or "connectors".
>
> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
> 0.8.x? From a release management perspective, the latter is preferable
> because the 0.7.x line artifacts are not in maven central. This makes
> building a real pain, and maintaining support for two versions won't be
> fun. Also, most of the people I have worked with are looking at 0.8.x for a
> variety of reasons, but I'm open to either way.
>
> - Taylor
>
>
> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
> michael+storm@michael-noll.com> wrote:
> >
> > Thanks for starting this discussion, Taylor.
> >
> > As a user of Storm (and a small-scale contributor to storm-starter) as
> > well as a user of Kafka, here are my $.02.
> >
> > [Storm and Kafka]
> > First, I agree with Nathan that storm-kafka should be considered to be
> > brought in.  While various "integrate Storm with X" options exist,
> > basically everyone I have been talking to is using Kafka in
> > combination with Storm.  I'm sure this is not a representative sample
> > of Storm users, and of course one may or may not agree that Kafka is
> > important enough of a technology in Storm's ecosystem.  Still, I do
> > see the need to make sure Storm and Kafka do work together without
> > having to go through forks of forks on GitHub and spending days to
> > figure out how to get data from Kafka (0.8) into Storm.
> >    Speaking of Kafka spout implementations, please don't forget
> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> > We've been quite happy with the former, so I'd suggest to at least
> > consider both options here (maybe the two projects can even join
> forces?).
> >
> > [Storm examples, storm-starter]
> > Second, IMHO every open source project should have a "1-click starting
> > experience" for new users.  That's very much related to the project
> > principles of tools like LogStash [1] who say: "Community: If a newbie
> > has a bad time, it's a bug."  For this reason I personally would like
> > to see the equivalent of storm-starter being brought into the "core"
> > Storm project -- think of an examples/ sub-module.  If the level of
> > effort is deemed too high to e.g. maintain what's already in
> > storm-starter, then (say) reduce the scope and remove some of the
> > examples.  In any case I'd personally would like to see bundled
> > examples that are known to work with the latest version of Storm.
> > storm-starter is often used to show new users how to get started with
> > Storm (I used that approach in my Storm blog posts, for instance, and
> > others like Mesosphere.io are even using storm-starter for their
> > commercial offerings [2]).
> >
> > [Have Storm up and running faster than you can brew an espresso]
> > Third, for the same reason (get people up and running in a few
> > minutes), I do like that other people in this thread have been
> > bringing up projects like storm-deploy.  For the same reason I have
> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> > few days ago, and I'll soon open source another Vagrant/Puppet based
> > tool that provides you with 1-click local and remote deployments of
> > Storm and Kafka clusters.  That's way better IMHO than having to
> > follow long articles or blog posts to deploy your first cluster.  And
> > there are a number of other people that have been rolling their own
> > variants.  Now don't get me wrong -- I don't mention this to pitch any
> > of those tools.  My intention is to say that it would be greatly
> > helpful to have /something/ like this for Storm, for the same reason
> > that it's nice to have LocalCluster for unit testing.  I have been
> > demo'ing both Storm and Kafka by launching clusters with a simple
> > command line, which always gets people excited.  If they can then rely
> > on existing examples (see above) to also /run/ an analysis on "their"
> > cluster then they have a beautiful start.
> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> > VM cluster setup, too [4] so that people can run the Aurora tutorial
> > on their machines in a few minutes.
> >
> > [Storm and YARN]
> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> > would be nice.  It ties into being able to run LocalCluster as well as
> > to run Storm in local or remote VMs -- but now alongside your existing
> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> > will surely be similarly attractive.
> >
> >
> > On a related note bringing the Storm docs up to speed with the quality
> > of the Storm code would also be great.  I have seen that since Storm
> > moved to Incubator several new sections have been added such as the
> > FAQ [5] (btw: nice!).
> >
> > Similarly, there should be better examples and docs for users how to
> > write unit tests for Storm.  Right now people seem to be cobbling
> > together their test code by figuring out how the 1-year old code in
> > [6] actually works, and copy-pasting other people's test code from
> GitHub.
> >
> > --
> >
> > As I said above, these are my personal $.02.  I admit that my comments
> > go a bit beyond the original question of bringing in contrib modules
> > -- it think implicitly the discussion about the contrib modules also
> > means "what do you need to provide a better and more well-rounded
> > experience", i.e. the question whether to have batteries included or
> > not. (As you may suspect I'm leaning towards included at least the
> > most important batteries, though what's really "important" for on the
> > project-level is of course up to debate.)
> >
> > On my side I'd be happy to help with those areas where I am able to
> > contribute, whether that's code and examples (like storm-starter) or
> > tutorials/docs (I already wrote e.g. [7] and [8]).
> >
> > Again, thanks Taylor for starting this discussion.  No matter the
> > actual outcome I'm sure the state of the project will be improved.
> >
> > Best,
> > Michael
> >
> >
> >
> > [1] https://github.com/elasticsearch/logstash
> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> > [3] https://github.com/miguno/puppet-storm
> > [4]
> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
> > [6]
> >
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> > [7]
> >
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> > [8]
> >
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >
> >
> >
> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >> Thanks for the feedback Bobby.
> >>
> >> To clarify, I'm mainly talking about spout/bolt/trident state
> >> implementations that integrate storm with *Technology X*, where
> >> *Technology X* is not a fundamental part of storm.
> >>
> >> Examples would be technologies that are part of or related to the
> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >> Kafka, HDFS, HBase, Cassandra, etc.
> >>
> >> The idea behind having one or more Storm committers act as a
> >> "sponsor" is to make sure new additions are done carefully and with
> >> good reason. To add a new module, it would require committer/PPMC
> >> consensus, and assignment of one or more sponsors. Part of a
> >> sponsor's job would be to ensure that a module is maintained, which
> >> would require enough familiarity with the code so support it long
> >> term. If a new module was proposed, but no committers were willing
> >> to act as a sponsor, it would not be added.
> >>
> >> It would be the Committers'/PPMC's responsibly to make sure things
> >> didn't get out of hand, and to do something about it if it does.
> >>
> >> Here's an old Hadoop JIRA thread [1] discussing the addition of
> >> Hive as a contrib module, similar to what happened with HBase as
> >> Bobby pointed out. Some interesting points are brought up. The
> >> difference here is that both HBase and Hive were pretty big
> >> codebases relative to Hadoop. With spout/bolt/state implementations
> >> I doubt we'd see anything along that scale.
> >>
> >> - Taylor
> >>
> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>
> >>
> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
> >> <ma...@yahoo-inc.com>> wrote:
> >>
> >>> I can see a lot of value in having a distribution of storm that
> >>> comes with batteries included, everything is tested together and
> >>> you know it works.  But I don't see much long term developer
> >>> benefit in building them all together.  If there is strong
> >>> coupling between storm and these external projects so that they
> >>> break when storm changes then we need to understand the coupling
> >>> and decide if we want to reduce that coupling by stabilizing
> >>> APIs, improving version numbering and release process, etc.; or
> >>> if the functionality is something that should be offered as a
> >>> base service in storm.
> >>>
> >>> I can see politically the value of giving these other projects a
> >>> home in Apache, and making them sub-projects is the simplest
> >>> route to that. I'd love to have storm on yarn inside Apache.  I
> >>> just don't want to go overboard with it.  There was a time when
> >>> HBase was a "contrib" module under Hadoop along with a lot of
> >>> other things, and the Apache board came and told Hadoop to brake
> >>> it up.
> >>>
> >>> Bringing storm-kafka into storm does not sound like it will solve
> >>> much from a developer's perspective, because there is at least as
> >>> much coupling with kafka as there is with storm.  I can see how
> >>> it is a huge amount of overhead and pain to set up a new project
> >>> just for a few hundred lines of code, as such I am in favor of
> >>> pulling in closely related projects, especially those that are
> >>> spouts and state implementations. I just want to be sure that we
> >>> do it carefully, with a good reason, and with enough people who
> >>> are familiar with the code to support it long term.
> >>>
> >>> If it starts to look like we are pulling in too many projects
> >>> perhaps we should look at something more like the bigtop project
> >>> https://bigtop.apache.org/ which produces a tested distribution
> >>> of Hadoop with many different sub-projects included in it.
> >>>
> >>> I am also a bit concerned about these sub-projects becoming
> >>> second class citizens, where we break something, but because the
> >>> build is off by default we don't know it.  I would prefer that
> >>> they are built and tested by default.  If the build and test time
> >>> starts to take too long, to me that means we need to start
> >>> wondering if we have too many contrib modules.
> >>>
> >>> --Bobby
> >>>
> >>> From: Brian Enochson <brian.enochson@gmail.com
> >>> <ma...@gmail.com>>
> > Reply-To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> > Date: Tuesday, February 25, 2014 at 9:50 PM
> >>> To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> > Cc: "dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>"
> > <dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>>
> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>
> >>> hi, I am in agreement with Taylor and believe I understand his
> >>> intent. An incredible tool/framework/application like Storm is
> >>> only enhanced and gains value from the number of well maintained
> >>> and vetted modules that can be used for integration and adding
> >>> further functionality. I am relatively new to the Storm community
> >>> but have spent quite some time reviewing contributing modules out
> >>> there, reviewing various duplicates and running into some version
> >>> incompatibilities. I understand the need to keep Storm itself
> >>> pure, but do think there needs to be some structure and
> >>> governance added to the contributing modules. Look at the benefit
> >>> a tool like npm brings to the node community. I like the idea of
> >>> sponsorship, vetting and a community vote.  I, as sure many would
> >>> be, am willing to offer support and time to working through how
> >>> to set this up and helping with the implementation if it is
> >>> decided to pursue some solution. I hope these views are taken in
> >>> the sprit they are made, to make this incredible system even
> >>> better along with the surrounding eco-system.
> >>>
> >>> Thanks, Brian
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote: Just
> >>> to be clear (and play a little Devil's advocate :) ), I'm not
> >>> suggesting that whatever a "contrib" project/module/subproject
> >>> might become, be a clearinghouse for anything Storm-related.
> >>>
> >>> I see it as something that is well-vetted by the Storm
> >>> community, subject to PPMC review, vote, etc. Entry would require
> >>> community review, PPMC review, and in some cases ASF IP
> >>> clearance/legal review. Anything added would require some level
> >>> of commitment from the PPMC/committers to provide some level of
> >>> support.
> >>>
> >>> In other words, nothing "willy-nilly".
> >>>
> >>> One option could be that any module added require (X > 0)  number
> >>> of committers to volunteer as "sponsor"s for the module, and
> >>> commit to maintaining it.
> >>>
> >>> That being said, I don't see storm-kafka being any different
> >>> from anything else that provides integration points for Storm.
> >>>
> >>> -Taylor
> >>>
> >>>
> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
> >>> <ma...@nathanmarz.com>>
> >>> wrote:
> >>>
> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>> projects put these contrib modules in a "contrib" folder and keep
> >>> them managed as completely separate codebases. As it's not
> >>> actually a "module" necessary for Storm, there's an argument
> >>> there for doing it that way rather than via the multi-module
> >>> route.
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>> <mpathira@umail.iu.edu
> >>> <ma...@umail.iu.edu>>
> >>> wrote: Hi Taylor,
> >>>
> >>> I'm +1 for pulling these external libraries into Apache codebase.
> >>> This will certainly benifit Strom community. I also like to
> >>> contribute to this process.
> >>>
> >>> Thanks Milinda
> >>>
> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>> pulling in "contrib" modules to the Apache codebase.
> >>>>
> >>>> In the past, we had the storm-contrib github project [2] which
> >>>> subsequently got broken up into individual projects hosted on
> >>>> the stormprocessor github group [3] and elsewhere.
> >>>>
> >>>> The problem with this approach is that in certain cases it led
> >>>> to code rot (modules not being updated in step with Storm's
> >>>> API), fragmentation (multiple similar modules with the same
> >>>> name), and confusion.
> >>>>
> >>>> A good example of this is the storm-kafka module [4], since it
> >>>> is a widely used component. Because storm-contrib wasn't being
> >>>> tagged in github, a lot of users had trouble reconciling with
> >>>> which versions of storm it was compatible. Some users built off
> >>>> specific commit hashes, some forked, and a few even pushed
> >>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>> now available, there are two main storm-kafka projects, the
> >>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>> (compatible with kafka 0.8).
> >>>>
> >>>> My intention is not to find fault in any way, but rather to
> >>>> point out the resulting pain, and work toward a better
> >>>> solution.
> >>>>
> >>>> I think it would be beneficial to the Storm user community to
> >>>> have certain commonly used modules like storm-kafka brought
> >>>> into the Apache Storm project. Another benefit worth
> >>>> considering is the licensing/legal oversight that the ASF
> >>>> provides, which is important to many users.
> >>>>
> >>>> If this is something we want to do, then the big question
> >>>> becomes what sort governance process needs to be established to
> >>>> ensure that such things are properly maintained.
> >>>>
> >>>> Some random thoughts, questions, etc. that jump to mind
> >>>> include:
> >>>>
> >>>> What to call these things: "contib modules", "connectors",
> >>>> "integration modules", etc.? Build integration: I imagine they
> >>>> would be a multi-module submodule of the main maven build.
> >>>> Probably turned off by default and enabled by a maven profile.
> >>>> Governance: Have one or more committer volunteers responsible
> >>>> for maintenance, merging patches, etc.? Proposal process for
> >>>> pulling new modules?
> >>>>
> >>>>
> >>>> I look forward to hearing others' opinions.
> >>>>
> >>>> - Taylor
> >>>>
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>> https://github.com/stormprocessor [4]
> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

I don't like either name tbh. Storm itself is already broken into modules
(storm-core, storm-netty, etc) and things like storm-starter and
storm-kafka are something different. I don't like "connectors" because
something like storm-starter is not a connector. Maybe we call them
"extras"?

I would say just to support 0.8.x of Kafka.


On Wed, Mar 12, 2014 at 11:33 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Incorporation of storm starter is underway.
>
> I'd like to turn the attention to kafka, with the goal being to pull in
> kafka support that is maintained and will be known to be compatible with
> the current version of storm and specific version(s) of kafka.
>
> I have the following questions for the community:
>
> 1. What do we want to call additions like this? I'm leaning toward
> "modules" or "connectors".
>
> 2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just
> 0.8.x? From a release management perspective, the latter is preferable
> because the 0.7.x line artifacts are not in maven central. This makes
> building a real pain, and maintaining support for two versions won't be
> fun. Also, most of the people I have worked with are looking at 0.8.x for a
> variety of reasons, but I'm open to either way.
>
> - Taylor
>
>
> > On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <
> michael+storm@michael-noll.com> wrote:
> >
> > Thanks for starting this discussion, Taylor.
> >
> > As a user of Storm (and a small-scale contributor to storm-starter) as
> > well as a user of Kafka, here are my $.02.
> >
> > [Storm and Kafka]
> > First, I agree with Nathan that storm-kafka should be considered to be
> > brought in.  While various "integrate Storm with X" options exist,
> > basically everyone I have been talking to is using Kafka in
> > combination with Storm.  I'm sure this is not a representative sample
> > of Storm users, and of course one may or may not agree that Kafka is
> > important enough of a technology in Storm's ecosystem.  Still, I do
> > see the need to make sure Storm and Kafka do work together without
> > having to go through forks of forks on GitHub and spending days to
> > figure out how to get data from Kafka (0.8) into Storm.
> >    Speaking of Kafka spout implementations, please don't forget
> > https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> > We've been quite happy with the former, so I'd suggest to at least
> > consider both options here (maybe the two projects can even join
> forces?).
> >
> > [Storm examples, storm-starter]
> > Second, IMHO every open source project should have a "1-click starting
> > experience" for new users.  That's very much related to the project
> > principles of tools like LogStash [1] who say: "Community: If a newbie
> > has a bad time, it's a bug."  For this reason I personally would like
> > to see the equivalent of storm-starter being brought into the "core"
> > Storm project -- think of an examples/ sub-module.  If the level of
> > effort is deemed too high to e.g. maintain what's already in
> > storm-starter, then (say) reduce the scope and remove some of the
> > examples.  In any case I'd personally would like to see bundled
> > examples that are known to work with the latest version of Storm.
> > storm-starter is often used to show new users how to get started with
> > Storm (I used that approach in my Storm blog posts, for instance, and
> > others like Mesosphere.io are even using storm-starter for their
> > commercial offerings [2]).
> >
> > [Have Storm up and running faster than you can brew an espresso]
> > Third, for the same reason (get people up and running in a few
> > minutes), I do like that other people in this thread have been
> > bringing up projects like storm-deploy.  For the same reason I have
> > open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> > few days ago, and I'll soon open source another Vagrant/Puppet based
> > tool that provides you with 1-click local and remote deployments of
> > Storm and Kafka clusters.  That's way better IMHO than having to
> > follow long articles or blog posts to deploy your first cluster.  And
> > there are a number of other people that have been rolling their own
> > variants.  Now don't get me wrong -- I don't mention this to pitch any
> > of those tools.  My intention is to say that it would be greatly
> > helpful to have /something/ like this for Storm, for the same reason
> > that it's nice to have LocalCluster for unit testing.  I have been
> > demo'ing both Storm and Kafka by launching clusters with a simple
> > command line, which always gets people excited.  If they can then rely
> > on existing examples (see above) to also /run/ an analysis on "their"
> > cluster then they have a beautiful start.
> >    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> > VM cluster setup, too [4] so that people can run the Aurora tutorial
> > on their machines in a few minutes.
> >
> > [Storm and YARN]
> > Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> > would be nice.  It ties into being able to run LocalCluster as well as
> > to run Storm in local or remote VMs -- but now alongside your existing
> > Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> > will surely be similarly attractive.
> >
> >
> > On a related note bringing the Storm docs up to speed with the quality
> > of the Storm code would also be great.  I have seen that since Storm
> > moved to Incubator several new sections have been added such as the
> > FAQ [5] (btw: nice!).
> >
> > Similarly, there should be better examples and docs for users how to
> > write unit tests for Storm.  Right now people seem to be cobbling
> > together their test code by figuring out how the 1-year old code in
> > [6] actually works, and copy-pasting other people's test code from
> GitHub.
> >
> > --
> >
> > As I said above, these are my personal $.02.  I admit that my comments
> > go a bit beyond the original question of bringing in contrib modules
> > -- it think implicitly the discussion about the contrib modules also
> > means "what do you need to provide a better and more well-rounded
> > experience", i.e. the question whether to have batteries included or
> > not. (As you may suspect I'm leaning towards included at least the
> > most important batteries, though what's really "important" for on the
> > project-level is of course up to debate.)
> >
> > On my side I'd be happy to help with those areas where I am able to
> > contribute, whether that's code and examples (like storm-starter) or
> > tutorials/docs (I already wrote e.g. [7] and [8]).
> >
> > Again, thanks Taylor for starting this discussion.  No matter the
> > actual outcome I'm sure the state of the project will be improved.
> >
> > Best,
> > Michael
> >
> >
> >
> > [1] https://github.com/elasticsearch/logstash
> > [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> > [3] https://github.com/miguno/puppet-storm
> > [4]
> https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> > [5] http://storm.incubator.apache.org/documentation/FAQ.html
> > [6]
> >
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> > [7]
> >
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> > [8]
> >
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> >
> >
> >
> >> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> >> Thanks for the feedback Bobby.
> >>
> >> To clarify, I'm mainly talking about spout/bolt/trident state
> >> implementations that integrate storm with *Technology X*, where
> >> *Technology X* is not a fundamental part of storm.
> >>
> >> Examples would be technologies that are part of or related to the
> >> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.:
> >> Kafka, HDFS, HBase, Cassandra, etc.
> >>
> >> The idea behind having one or more Storm committers act as a
> >> "sponsor" is to make sure new additions are done carefully and with
> >> good reason. To add a new module, it would require committer/PPMC
> >> consensus, and assignment of one or more sponsors. Part of a
> >> sponsor's job would be to ensure that a module is maintained, which
> >> would require enough familiarity with the code so support it long
> >> term. If a new module was proposed, but no committers were willing
> >> to act as a sponsor, it would not be added.
> >>
> >> It would be the Committers'/PPMC's responsibly to make sure things
> >> didn't get out of hand, and to do something about it if it does.
> >>
> >> Here's an old Hadoop JIRA thread [1] discussing the addition of
> >> Hive as a contrib module, similar to what happened with HBase as
> >> Bobby pointed out. Some interesting points are brought up. The
> >> difference here is that both HBase and Hive were pretty big
> >> codebases relative to Hadoop. With spout/bolt/state implementations
> >> I doubt we'd see anything along that scale.
> >>
> >> - Taylor
> >>
> >> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> >>
> >>
> >> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com
> >> <ma...@yahoo-inc.com>> wrote:
> >>
> >>> I can see a lot of value in having a distribution of storm that
> >>> comes with batteries included, everything is tested together and
> >>> you know it works.  But I don't see much long term developer
> >>> benefit in building them all together.  If there is strong
> >>> coupling between storm and these external projects so that they
> >>> break when storm changes then we need to understand the coupling
> >>> and decide if we want to reduce that coupling by stabilizing
> >>> APIs, improving version numbering and release process, etc.; or
> >>> if the functionality is something that should be offered as a
> >>> base service in storm.
> >>>
> >>> I can see politically the value of giving these other projects a
> >>> home in Apache, and making them sub-projects is the simplest
> >>> route to that. I'd love to have storm on yarn inside Apache.  I
> >>> just don't want to go overboard with it.  There was a time when
> >>> HBase was a "contrib" module under Hadoop along with a lot of
> >>> other things, and the Apache board came and told Hadoop to brake
> >>> it up.
> >>>
> >>> Bringing storm-kafka into storm does not sound like it will solve
> >>> much from a developer's perspective, because there is at least as
> >>> much coupling with kafka as there is with storm.  I can see how
> >>> it is a huge amount of overhead and pain to set up a new project
> >>> just for a few hundred lines of code, as such I am in favor of
> >>> pulling in closely related projects, especially those that are
> >>> spouts and state implementations. I just want to be sure that we
> >>> do it carefully, with a good reason, and with enough people who
> >>> are familiar with the code to support it long term.
> >>>
> >>> If it starts to look like we are pulling in too many projects
> >>> perhaps we should look at something more like the bigtop project
> >>> https://bigtop.apache.org/ which produces a tested distribution
> >>> of Hadoop with many different sub-projects included in it.
> >>>
> >>> I am also a bit concerned about these sub-projects becoming
> >>> second class citizens, where we break something, but because the
> >>> build is off by default we don't know it.  I would prefer that
> >>> they are built and tested by default.  If the build and test time
> >>> starts to take too long, to me that means we need to start
> >>> wondering if we have too many contrib modules.
> >>>
> >>> --Bobby
> >>>
> >>> From: Brian Enochson <brian.enochson@gmail.com
> >>> <ma...@gmail.com>>
> > Reply-To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> > Date: Tuesday, February 25, 2014 at 9:50 PM
> >>> To: "user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>"
> > <user@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> > Cc: "dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>"
> > <dev@storm.incubator.apache.org
> >>> <ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>>
> > Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> >>>
> >>> hi, I am in agreement with Taylor and believe I understand his
> >>> intent. An incredible tool/framework/application like Storm is
> >>> only enhanced and gains value from the number of well maintained
> >>> and vetted modules that can be used for integration and adding
> >>> further functionality. I am relatively new to the Storm community
> >>> but have spent quite some time reviewing contributing modules out
> >>> there, reviewing various duplicates and running into some version
> >>> incompatibilities. I understand the need to keep Storm itself
> >>> pure, but do think there needs to be some structure and
> >>> governance added to the contributing modules. Look at the benefit
> >>> a tool like npm brings to the node community. I like the idea of
> >>> sponsorship, vetting and a community vote.  I, as sure many would
> >>> be, am willing to offer support and time to working through how
> >>> to set this up and helping with the implementation if it is
> >>> decided to pursue some solution. I hope these views are taken in
> >>> the sprit they are made, to make this incredible system even
> >>> better along with the surrounding eco-system.
> >>>
> >>> Thanks, Brian
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote: Just
> >>> to be clear (and play a little Devil's advocate :) ), I'm not
> >>> suggesting that whatever a "contrib" project/module/subproject
> >>> might become, be a clearinghouse for anything Storm-related.
> >>>
> >>> I see it as something that is well-vetted by the Storm
> >>> community, subject to PPMC review, vote, etc. Entry would require
> >>> community review, PPMC review, and in some cases ASF IP
> >>> clearance/legal review. Anything added would require some level
> >>> of commitment from the PPMC/committers to provide some level of
> >>> support.
> >>>
> >>> In other words, nothing "willy-nilly".
> >>>
> >>> One option could be that any module added require (X > 0)  number
> >>> of committers to volunteer as "sponsor"s for the module, and
> >>> commit to maintaining it.
> >>>
> >>> That being said, I don't see storm-kafka being any different
> >>> from anything else that provides integration points for Storm.
> >>>
> >>> -Taylor
> >>>
> >>>
> >>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com
> >>> <ma...@nathanmarz.com>>
> >>> wrote:
> >>>
> >>> I'm only +1 for pulling in storm-kafka and updating it. Other
> >>> projects put these contrib modules in a "contrib" folder and keep
> >>> them managed as completely separate codebases. As it's not
> >>> actually a "module" necessary for Storm, there's an argument
> >>> there for doing it that way rather than via the multi-module
> >>> route.
> >>>
> >>>
> >>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
> >>> <mpathira@umail.iu.edu
> >>> <ma...@umail.iu.edu>>
> >>> wrote: Hi Taylor,
> >>>
> >>> I'm +1 for pulling these external libraries into Apache codebase.
> >>> This will certainly benifit Strom community. I also like to
> >>> contribute to this process.
> >>>
> >>> Thanks Milinda
> >>>
> >>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
> >>> <ptgoetz@gmail.com
> >>> <ma...@gmail.com>> wrote:
> >>>> A while back I opened STORM-206 [1] to capture ideas for
> >>>> pulling in "contrib" modules to the Apache codebase.
> >>>>
> >>>> In the past, we had the storm-contrib github project [2] which
> >>>> subsequently got broken up into individual projects hosted on
> >>>> the stormprocessor github group [3] and elsewhere.
> >>>>
> >>>> The problem with this approach is that in certain cases it led
> >>>> to code rot (modules not being updated in step with Storm's
> >>>> API), fragmentation (multiple similar modules with the same
> >>>> name), and confusion.
> >>>>
> >>>> A good example of this is the storm-kafka module [4], since it
> >>>> is a widely used component. Because storm-contrib wasn't being
> >>>> tagged in github, a lot of users had trouble reconciling with
> >>>> which versions of storm it was compatible. Some users built off
> >>>> specific commit hashes, some forked, and a few even pushed
> >>>> custom builds to repositories such as clojars. With kafka 0.8
> >>>> now available, there are two main storm-kafka projects, the
> >>>> original (compatible with kafka 0.7) and an updated fork [5]
> >>>> (compatible with kafka 0.8).
> >>>>
> >>>> My intention is not to find fault in any way, but rather to
> >>>> point out the resulting pain, and work toward a better
> >>>> solution.
> >>>>
> >>>> I think it would be beneficial to the Storm user community to
> >>>> have certain commonly used modules like storm-kafka brought
> >>>> into the Apache Storm project. Another benefit worth
> >>>> considering is the licensing/legal oversight that the ASF
> >>>> provides, which is important to many users.
> >>>>
> >>>> If this is something we want to do, then the big question
> >>>> becomes what sort governance process needs to be established to
> >>>> ensure that such things are properly maintained.
> >>>>
> >>>> Some random thoughts, questions, etc. that jump to mind
> >>>> include:
> >>>>
> >>>> What to call these things: "contib modules", "connectors",
> >>>> "integration modules", etc.? Build integration: I imagine they
> >>>> would be a multi-module submodule of the main maven build.
> >>>> Probably turned off by default and enabled by a maven profile.
> >>>> Governance: Have one or more committer volunteers responsible
> >>>> for maintenance, merging patches, etc.? Proposal process for
> >>>> pulling new modules?
> >>>>
> >>>>
> >>>> I look forward to hearing others' opinions.
> >>>>
> >>>> - Taylor
> >>>>
> >>>>
> >>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
> >>>> https://github.com/nathanmarz/storm-contrib [3]
> >>>> https://github.com/stormprocessor [4]
> >>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> >
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Incorporation of storm starter is underway.

I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.

I have the following questions for the community:

1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".

2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.

- Taylor


> On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
> 
> Thanks for starting this discussion, Taylor.
> 
> As a user of Storm (and a small-scale contributor to storm-starter) as
> well as a user of Kafka, here are my $.02.
> 
> [Storm and Kafka]
> First, I agree with Nathan that storm-kafka should be considered to be
> brought in.  While various "integrate Storm with X" options exist,
> basically everyone I have been talking to is using Kafka in
> combination with Storm.  I'm sure this is not a representative sample
> of Storm users, and of course one may or may not agree that Kafka is
> important enough of a technology in Storm's ecosystem.  Still, I do
> see the need to make sure Storm and Kafka do work together without
> having to go through forks of forks on GitHub and spending days to
> figure out how to get data from Kafka (0.8) into Storm.
>    Speaking of Kafka spout implementations, please don't forget
> https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> We've been quite happy with the former, so I'd suggest to at least
> consider both options here (maybe the two projects can even join forces?).
> 
> [Storm examples, storm-starter]
> Second, IMHO every open source project should have a "1-click starting
> experience" for new users.  That's very much related to the project
> principles of tools like LogStash [1] who say: "Community: If a newbie
> has a bad time, it's a bug."  For this reason I personally would like
> to see the equivalent of storm-starter being brought into the "core"
> Storm project -- think of an examples/ sub-module.  If the level of
> effort is deemed too high to e.g. maintain what's already in
> storm-starter, then (say) reduce the scope and remove some of the
> examples.  In any case I'd personally would like to see bundled
> examples that are known to work with the latest version of Storm.
> storm-starter is often used to show new users how to get started with
> Storm (I used that approach in my Storm blog posts, for instance, and
> others like Mesosphere.io are even using storm-starter for their
> commercial offerings [2]).
> 
> [Have Storm up and running faster than you can brew an espresso]
> Third, for the same reason (get people up and running in a few
> minutes), I do like that other people in this thread have been
> bringing up projects like storm-deploy.  For the same reason I have
> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> few days ago, and I'll soon open source another Vagrant/Puppet based
> tool that provides you with 1-click local and remote deployments of
> Storm and Kafka clusters.  That's way better IMHO than having to
> follow long articles or blog posts to deploy your first cluster.  And
> there are a number of other people that have been rolling their own
> variants.  Now don't get me wrong -- I don't mention this to pitch any
> of those tools.  My intention is to say that it would be greatly
> helpful to have /something/ like this for Storm, for the same reason
> that it's nice to have LocalCluster for unit testing.  I have been
> demo'ing both Storm and Kafka by launching clusters with a simple
> command line, which always gets people excited.  If they can then rely
> on existing examples (see above) to also /run/ an analysis on "their"
> cluster then they have a beautiful start.
>    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> VM cluster setup, too [4] so that people can run the Aurora tutorial
> on their machines in a few minutes.
> 
> [Storm and YARN]
> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> would be nice.  It ties into being able to run LocalCluster as well as
> to run Storm in local or remote VMs -- but now alongside your existing
> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> will surely be similarly attractive.
> 
> 
> On a related note bringing the Storm docs up to speed with the quality
> of the Storm code would also be great.  I have seen that since Storm
> moved to Incubator several new sections have been added such as the
> FAQ [5] (btw: nice!).
> 
> Similarly, there should be better examples and docs for users how to
> write unit tests for Storm.  Right now people seem to be cobbling
> together their test code by figuring out how the 1-year old code in
> [6] actually works, and copy-pasting other people's test code from GitHub.
> 
> --
> 
> As I said above, these are my personal $.02.  I admit that my comments
> go a bit beyond the original question of bringing in contrib modules
> -- it think implicitly the discussion about the contrib modules also
> means "what do you need to provide a better and more well-rounded
> experience", i.e. the question whether to have batteries included or
> not. (As you may suspect I'm leaning towards included at least the
> most important batteries, though what's really "important" for on the
> project-level is of course up to debate.)
> 
> On my side I'd be happy to help with those areas where I am able to
> contribute, whether that's code and examples (like storm-starter) or
> tutorials/docs (I already wrote e.g. [7] and [8]).
> 
> Again, thanks Taylor for starting this discussion.  No matter the
> actual outcome I'm sure the state of the project will be improved.
> 
> Best,
> Michael
> 
> 
> 
> [1] https://github.com/elasticsearch/logstash
> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> [3] https://github.com/miguno/puppet-storm
> [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> [5] http://storm.incubator.apache.org/documentation/FAQ.html
> [6]
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> [7]
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> [8]
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> 
> 
> 
>> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> Thanks for the feedback Bobby.
>> 
>> To clarify, I’m mainly talking about spout/bolt/trident state 
>> implementations that integrate storm with *Technology X*, where 
>> *Technology X* is not a fundamental part of storm.
>> 
>> Examples would be technologies that are part of or related to the 
>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>> Kafka, HDFS, HBase, Cassandra, etc.
>> 
>> The idea behind having one or more Storm committers act as a
>> “sponsor” is to make sure new additions are done carefully and with
>> good reason. To add a new module, it would require committer/PPMC
>> consensus, and assignment of one or more sponsors. Part of a
>> sponsor’s job would be to ensure that a module is maintained, which
>> would require enough familiarity with the code so support it long
>> term. If a new module was proposed, but no committers were willing
>> to act as a sponsor, it would not be added.
>> 
>> It would be the Committers’/PPMC’s responsibly to make sure things 
>> didn’t get out of hand, and to do something about it if it does.
>> 
>> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> Hive as a contrib module, similar to what happened with HBase as
>> Bobby pointed out. Some interesting points are brought up. The
>> difference here is that both HBase and Hive were pretty big
>> codebases relative to Hadoop. With spout/bolt/state implementations
>> I doubt we’d see anything along that scale.
>> 
>> - Taylor
>> 
>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> 
>> 
>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
>> <ma...@yahoo-inc.com>> wrote:
>> 
>>> I can see a lot of value in having a distribution of storm that
>>> comes with batteries included, everything is tested together and
>>> you know it works.  But I don’t see much long term developer
>>> benefit in building them all together.  If there is strong
>>> coupling between storm and these external projects so that they
>>> break when storm changes then we need to understand the coupling
>>> and decide if we want to reduce that coupling by stabilizing
>>> APIs, improving version numbering and release process, etc.; or
>>> if the functionality is something that should be offered as a
>>> base service in storm.
>>> 
>>> I can see politically the value of giving these other projects a
>>> home in Apache, and making them sub-projects is the simplest
>>> route to that. I’d love to have storm on yarn inside Apache.  I
>>> just don’t want to go overboard with it.  There was a time when
>>> HBase was a “contrib” module under Hadoop along with a lot of
>>> other things, and the Apache board came and told Hadoop to brake
>>> it up.
>>> 
>>> Bringing storm-kafka into storm does not sound like it will solve
>>> much from a developer’s perspective, because there is at least as
>>> much coupling with kafka as there is with storm.  I can see how
>>> it is a huge amount of overhead and pain to set up a new project
>>> just for a few hundred lines of code, as such I am in favor of
>>> pulling in closely related projects, especially those that are
>>> spouts and state implementations. I just want to be sure that we
>>> do it carefully, with a good reason, and with enough people who
>>> are familiar with the code to support it long term.
>>> 
>>> If it starts to look like we are pulling in too many projects
>>> perhaps we should look at something more like the bigtop project 
>>> https://bigtop.apache.org/ which produces a tested distribution
>>> of Hadoop with many different sub-projects included in it.
>>> 
>>> I am also a bit concerned about these sub-projects becoming
>>> second class citizens, where we break something, but because the
>>> build is off by default we don’t know it.  I would prefer that
>>> they are built and tested by default.  If the build and test time
>>> starts to take too long, to me that means we need to start
>>> wondering if we have too many contrib modules.
>>> 
>>> —Bobby
>>> 
>>> From: Brian Enochson <brian.enochson@gmail.com 
>>> <ma...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
>>> To: "user@storm.incubator.apache.org 
>>> <ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
> <dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>> 
>>> hi, I am in agreement with Taylor and believe I understand his
>>> intent. An incredible tool/framework/application like Storm is
>>> only enhanced and gains value from the number of well maintained
>>> and vetted modules that can be used for integration and adding
>>> further functionality. I am relatively new to the Storm community
>>> but have spent quite some time reviewing contributing modules out
>>> there, reviewing various duplicates and running into some version
>>> incompatibilities. I understand the need to keep Storm itself
>>> pure, but do think there needs to be some structure and
>>> governance added to the contributing modules. Look at the benefit
>>> a tool like npm brings to the node community. I like the idea of
>>> sponsorship, vetting and a community vote.  I, as sure many would
>>> be, am willing to offer support and time to working through how
>>> to set this up and helping with the implementation if it is
>>> decided to pursue some solution. I hope these views are taken in
>>> the sprit they are made, to make this incredible system even
>>> better along with the surrounding eco-system.
>>> 
>>> Thanks, Brian
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote: Just
>>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>>> suggesting that whatever a “contrib” project/module/subproject
>>> might become, be a clearinghouse for anything Storm-related.
>>> 
>>> I see it as something that is well-vetted by the Storm
>>> community, subject to PPMC review, vote, etc. Entry would require
>>> community review, PPMC review, and in some cases ASF IP
>>> clearance/legal review. Anything added would require some level
>>> of commitment from the PPMC/committers to provide some level of
>>> support.
>>> 
>>> In other words, nothing “willy-nilly”.
>>> 
>>> One option could be that any module added require (X > 0)  number
>>> of committers to volunteer as “sponsor”s for the module, and
>>> commit to maintaining it.
>>> 
>>> That being said, I don’t see storm-kafka being any different
>>> from anything else that provides integration points for Storm.
>>> 
>>> -Taylor
>>> 
>>> 
>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>>> <ma...@nathanmarz.com>>
>>> wrote:
>>> 
>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>> projects put these contrib modules in a "contrib" folder and keep
>>> them managed as completely separate codebases. As it's not
>>> actually a "module" necessary for Storm, there's an argument
>>> there for doing it that way rather than via the multi-module
>>> route.
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>>> <mpathira@umail.iu.edu 
>>> <ma...@umail.iu.edu>>
>>> wrote: Hi Taylor,
>>> 
>>> I'm +1 for pulling these external libraries into Apache codebase.
>>> This will certainly benifit Strom community. I also like to
>>> contribute to this process.
>>> 
>>> Thanks Milinda
>>> 
>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>> pulling in "contrib" modules to the Apache codebase.
>>>> 
>>>> In the past, we had the storm-contrib github project [2] which 
>>>> subsequently got broken up into individual projects hosted on
>>>> the stormprocessor github group [3] and elsewhere.
>>>> 
>>>> The problem with this approach is that in certain cases it led
>>>> to code rot (modules not being updated in step with Storm's
>>>> API), fragmentation (multiple similar modules with the same
>>>> name), and confusion.
>>>> 
>>>> A good example of this is the storm-kafka module [4], since it
>>>> is a widely used component. Because storm-contrib wasn't being
>>>> tagged in github, a lot of users had trouble reconciling with
>>>> which versions of storm it was compatible. Some users built off
>>>> specific commit hashes, some forked, and a few even pushed
>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>> now available, there are two main storm-kafka projects, the
>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>> (compatible with kafka 0.8).
>>>> 
>>>> My intention is not to find fault in any way, but rather to
>>>> point out the resulting pain, and work toward a better
>>>> solution.
>>>> 
>>>> I think it would be beneficial to the Storm user community to
>>>> have certain commonly used modules like storm-kafka brought
>>>> into the Apache Storm project. Another benefit worth
>>>> considering is the licensing/legal oversight that the ASF
>>>> provides, which is important to many users.
>>>> 
>>>> If this is something we want to do, then the big question
>>>> becomes what sort governance process needs to be established to
>>>> ensure that such things are properly maintained.
>>>> 
>>>> Some random thoughts, questions, etc. that jump to mind
>>>> include:
>>>> 
>>>> What to call these things: "contib modules", "connectors",
>>>> "integration modules", etc.? Build integration: I imagine they
>>>> would be a multi-module submodule of the main maven build.
>>>> Probably turned off by default and enabled by a maven profile. 
>>>> Governance: Have one or more committer volunteers responsible
>>>> for maintenance, merging patches, etc.? Proposal process for
>>>> pulling new modules?
>>>> 
>>>> 
>>>> I look forward to hearing others' opinions.
>>>> 
>>>> - Taylor
>>>> 
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>> https://github.com/stormprocessor [4]
>>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Incorporation of storm starter is underway.

I'd like to turn the attention to kafka, with the goal being to pull in kafka support that is maintained and will be known to be compatible with the current version of storm and specific version(s) of kafka.

I have the following questions for the community:

1. What do we want to call additions like this? I'm leaning toward "modules" or "connectors".

2. Do we want to support both 0.7.x and 0.8.x versions of kafka, or just 0.8.x? From a release management perspective, the latter is preferable because the 0.7.x line artifacts are not in maven central. This makes building a real pain, and maintaining support for two versions won't be fun. Also, most of the people I have worked with are looking at 0.8.x for a variety of reasons, but I'm open to either way.

- Taylor


> On Mar 1, 2014, at 5:11 AM, "Michael G. Noll" <mi...@michael-noll.com> wrote:
> 
> Thanks for starting this discussion, Taylor.
> 
> As a user of Storm (and a small-scale contributor to storm-starter) as
> well as a user of Kafka, here are my $.02.
> 
> [Storm and Kafka]
> First, I agree with Nathan that storm-kafka should be considered to be
> brought in.  While various "integrate Storm with X" options exist,
> basically everyone I have been talking to is using Kafka in
> combination with Storm.  I'm sure this is not a representative sample
> of Storm users, and of course one may or may not agree that Kafka is
> important enough of a technology in Storm's ecosystem.  Still, I do
> see the need to make sure Storm and Kafka do work together without
> having to go through forks of forks on GitHub and spending days to
> figure out how to get data from Kafka (0.8) into Storm.
>    Speaking of Kafka spout implementations, please don't forget
> https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> We've been quite happy with the former, so I'd suggest to at least
> consider both options here (maybe the two projects can even join forces?).
> 
> [Storm examples, storm-starter]
> Second, IMHO every open source project should have a "1-click starting
> experience" for new users.  That's very much related to the project
> principles of tools like LogStash [1] who say: "Community: If a newbie
> has a bad time, it's a bug."  For this reason I personally would like
> to see the equivalent of storm-starter being brought into the "core"
> Storm project -- think of an examples/ sub-module.  If the level of
> effort is deemed too high to e.g. maintain what's already in
> storm-starter, then (say) reduce the scope and remove some of the
> examples.  In any case I'd personally would like to see bundled
> examples that are known to work with the latest version of Storm.
> storm-starter is often used to show new users how to get started with
> Storm (I used that approach in my Storm blog posts, for instance, and
> others like Mesosphere.io are even using storm-starter for their
> commercial offerings [2]).
> 
> [Have Storm up and running faster than you can brew an espresso]
> Third, for the same reason (get people up and running in a few
> minutes), I do like that other people in this thread have been
> bringing up projects like storm-deploy.  For the same reason I have
> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> few days ago, and I'll soon open source another Vagrant/Puppet based
> tool that provides you with 1-click local and remote deployments of
> Storm and Kafka clusters.  That's way better IMHO than having to
> follow long articles or blog posts to deploy your first cluster.  And
> there are a number of other people that have been rolling their own
> variants.  Now don't get me wrong -- I don't mention this to pitch any
> of those tools.  My intention is to say that it would be greatly
> helpful to have /something/ like this for Storm, for the same reason
> that it's nice to have LocalCluster for unit testing.  I have been
> demo'ing both Storm and Kafka by launching clusters with a simple
> command line, which always gets people excited.  If they can then rely
> on existing examples (see above) to also /run/ an analysis on "their"
> cluster then they have a beautiful start.
>    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> VM cluster setup, too [4] so that people can run the Aurora tutorial
> on their machines in a few minutes.
> 
> [Storm and YARN]
> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> would be nice.  It ties into being able to run LocalCluster as well as
> to run Storm in local or remote VMs -- but now alongside your existing
> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> will surely be similarly attractive.
> 
> 
> On a related note bringing the Storm docs up to speed with the quality
> of the Storm code would also be great.  I have seen that since Storm
> moved to Incubator several new sections have been added such as the
> FAQ [5] (btw: nice!).
> 
> Similarly, there should be better examples and docs for users how to
> write unit tests for Storm.  Right now people seem to be cobbling
> together their test code by figuring out how the 1-year old code in
> [6] actually works, and copy-pasting other people's test code from GitHub.
> 
> --
> 
> As I said above, these are my personal $.02.  I admit that my comments
> go a bit beyond the original question of bringing in contrib modules
> -- it think implicitly the discussion about the contrib modules also
> means "what do you need to provide a better and more well-rounded
> experience", i.e. the question whether to have batteries included or
> not. (As you may suspect I'm leaning towards included at least the
> most important batteries, though what's really "important" for on the
> project-level is of course up to debate.)
> 
> On my side I'd be happy to help with those areas where I am able to
> contribute, whether that's code and examples (like storm-starter) or
> tutorials/docs (I already wrote e.g. [7] and [8]).
> 
> Again, thanks Taylor for starting this discussion.  No matter the
> actual outcome I'm sure the state of the project will be improved.
> 
> Best,
> Michael
> 
> 
> 
> [1] https://github.com/elasticsearch/logstash
> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> [3] https://github.com/miguno/puppet-storm
> [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> [5] http://storm.incubator.apache.org/documentation/FAQ.html
> [6]
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> [7]
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> [8]
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> 
> 
> 
>> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> Thanks for the feedback Bobby.
>> 
>> To clarify, I’m mainly talking about spout/bolt/trident state 
>> implementations that integrate storm with *Technology X*, where 
>> *Technology X* is not a fundamental part of storm.
>> 
>> Examples would be technologies that are part of or related to the 
>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>> Kafka, HDFS, HBase, Cassandra, etc.
>> 
>> The idea behind having one or more Storm committers act as a
>> “sponsor” is to make sure new additions are done carefully and with
>> good reason. To add a new module, it would require committer/PPMC
>> consensus, and assignment of one or more sponsors. Part of a
>> sponsor’s job would be to ensure that a module is maintained, which
>> would require enough familiarity with the code so support it long
>> term. If a new module was proposed, but no committers were willing
>> to act as a sponsor, it would not be added.
>> 
>> It would be the Committers’/PPMC’s responsibly to make sure things 
>> didn’t get out of hand, and to do something about it if it does.
>> 
>> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> Hive as a contrib module, similar to what happened with HBase as
>> Bobby pointed out. Some interesting points are brought up. The
>> difference here is that both HBase and Hive were pretty big
>> codebases relative to Hadoop. With spout/bolt/state implementations
>> I doubt we’d see anything along that scale.
>> 
>> - Taylor
>> 
>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> 
>> 
>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
>> <ma...@yahoo-inc.com>> wrote:
>> 
>>> I can see a lot of value in having a distribution of storm that
>>> comes with batteries included, everything is tested together and
>>> you know it works.  But I don’t see much long term developer
>>> benefit in building them all together.  If there is strong
>>> coupling between storm and these external projects so that they
>>> break when storm changes then we need to understand the coupling
>>> and decide if we want to reduce that coupling by stabilizing
>>> APIs, improving version numbering and release process, etc.; or
>>> if the functionality is something that should be offered as a
>>> base service in storm.
>>> 
>>> I can see politically the value of giving these other projects a
>>> home in Apache, and making them sub-projects is the simplest
>>> route to that. I’d love to have storm on yarn inside Apache.  I
>>> just don’t want to go overboard with it.  There was a time when
>>> HBase was a “contrib” module under Hadoop along with a lot of
>>> other things, and the Apache board came and told Hadoop to brake
>>> it up.
>>> 
>>> Bringing storm-kafka into storm does not sound like it will solve
>>> much from a developer’s perspective, because there is at least as
>>> much coupling with kafka as there is with storm.  I can see how
>>> it is a huge amount of overhead and pain to set up a new project
>>> just for a few hundred lines of code, as such I am in favor of
>>> pulling in closely related projects, especially those that are
>>> spouts and state implementations. I just want to be sure that we
>>> do it carefully, with a good reason, and with enough people who
>>> are familiar with the code to support it long term.
>>> 
>>> If it starts to look like we are pulling in too many projects
>>> perhaps we should look at something more like the bigtop project 
>>> https://bigtop.apache.org/ which produces a tested distribution
>>> of Hadoop with many different sub-projects included in it.
>>> 
>>> I am also a bit concerned about these sub-projects becoming
>>> second class citizens, where we break something, but because the
>>> build is off by default we don’t know it.  I would prefer that
>>> they are built and tested by default.  If the build and test time
>>> starts to take too long, to me that means we need to start
>>> wondering if we have too many contrib modules.
>>> 
>>> —Bobby
>>> 
>>> From: Brian Enochson <brian.enochson@gmail.com 
>>> <ma...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
>>> To: "user@storm.incubator.apache.org 
>>> <ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
> <dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>> 
>>> hi, I am in agreement with Taylor and believe I understand his
>>> intent. An incredible tool/framework/application like Storm is
>>> only enhanced and gains value from the number of well maintained
>>> and vetted modules that can be used for integration and adding
>>> further functionality. I am relatively new to the Storm community
>>> but have spent quite some time reviewing contributing modules out
>>> there, reviewing various duplicates and running into some version
>>> incompatibilities. I understand the need to keep Storm itself
>>> pure, but do think there needs to be some structure and
>>> governance added to the contributing modules. Look at the benefit
>>> a tool like npm brings to the node community. I like the idea of
>>> sponsorship, vetting and a community vote.  I, as sure many would
>>> be, am willing to offer support and time to working through how
>>> to set this up and helping with the implementation if it is
>>> decided to pursue some solution. I hope these views are taken in
>>> the sprit they are made, to make this incredible system even
>>> better along with the surrounding eco-system.
>>> 
>>> Thanks, Brian
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote: Just
>>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>>> suggesting that whatever a “contrib” project/module/subproject
>>> might become, be a clearinghouse for anything Storm-related.
>>> 
>>> I see it as something that is well-vetted by the Storm
>>> community, subject to PPMC review, vote, etc. Entry would require
>>> community review, PPMC review, and in some cases ASF IP
>>> clearance/legal review. Anything added would require some level
>>> of commitment from the PPMC/committers to provide some level of
>>> support.
>>> 
>>> In other words, nothing “willy-nilly”.
>>> 
>>> One option could be that any module added require (X > 0)  number
>>> of committers to volunteer as “sponsor”s for the module, and
>>> commit to maintaining it.
>>> 
>>> That being said, I don’t see storm-kafka being any different
>>> from anything else that provides integration points for Storm.
>>> 
>>> -Taylor
>>> 
>>> 
>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>>> <ma...@nathanmarz.com>>
>>> wrote:
>>> 
>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>> projects put these contrib modules in a "contrib" folder and keep
>>> them managed as completely separate codebases. As it's not
>>> actually a "module" necessary for Storm, there's an argument
>>> there for doing it that way rather than via the multi-module
>>> route.
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>>> <mpathira@umail.iu.edu 
>>> <ma...@umail.iu.edu>>
>>> wrote: Hi Taylor,
>>> 
>>> I'm +1 for pulling these external libraries into Apache codebase.
>>> This will certainly benifit Strom community. I also like to
>>> contribute to this process.
>>> 
>>> Thanks Milinda
>>> 
>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>> pulling in "contrib" modules to the Apache codebase.
>>>> 
>>>> In the past, we had the storm-contrib github project [2] which 
>>>> subsequently got broken up into individual projects hosted on
>>>> the stormprocessor github group [3] and elsewhere.
>>>> 
>>>> The problem with this approach is that in certain cases it led
>>>> to code rot (modules not being updated in step with Storm's
>>>> API), fragmentation (multiple similar modules with the same
>>>> name), and confusion.
>>>> 
>>>> A good example of this is the storm-kafka module [4], since it
>>>> is a widely used component. Because storm-contrib wasn't being
>>>> tagged in github, a lot of users had trouble reconciling with
>>>> which versions of storm it was compatible. Some users built off
>>>> specific commit hashes, some forked, and a few even pushed
>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>> now available, there are two main storm-kafka projects, the
>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>> (compatible with kafka 0.8).
>>>> 
>>>> My intention is not to find fault in any way, but rather to
>>>> point out the resulting pain, and work toward a better
>>>> solution.
>>>> 
>>>> I think it would be beneficial to the Storm user community to
>>>> have certain commonly used modules like storm-kafka brought
>>>> into the Apache Storm project. Another benefit worth
>>>> considering is the licensing/legal oversight that the ASF
>>>> provides, which is important to many users.
>>>> 
>>>> If this is something we want to do, then the big question
>>>> becomes what sort governance process needs to be established to
>>>> ensure that such things are properly maintained.
>>>> 
>>>> Some random thoughts, questions, etc. that jump to mind
>>>> include:
>>>> 
>>>> What to call these things: "contib modules", "connectors",
>>>> "integration modules", etc.? Build integration: I imagine they
>>>> would be a multi-module submodule of the main maven build.
>>>> Probably turned off by default and enabled by a maven profile. 
>>>> Governance: Have one or more committer volunteers responsible
>>>> for maintenance, merging patches, etc.? Proposal process for
>>>> pulling new modules?
>>>> 
>>>> 
>>>> I look forward to hearing others' opinions.
>>>> 
>>>> - Taylor
>>>> 
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>> https://github.com/stormprocessor [4]
>>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Thanks Michael and everyone else who participated in this discussion. It has been very constructive and raised some excellent points regarding not just the “contrib” module topic, but also how the project overall can be improved for both users and developers.

I don’t think it would reasonable (or advisable) to try to tackle everything at once, so I think it best to work it out on a piece-by-piece, case-by-case basis. I have a git repo that has most of these modules incorporated (with full commit history) and integrated into the maven build. It wouldn’t be hard to create separate pull requests for each, so they can be discussed and merged (or not) independently.

It might just be me, but in my experience project “contrib” directories can tend to be somewhat of a wild west in terms of how well they are maintained. Since one of the goals here is to make sure everything added becomes a first-class citizen within the project, I’m leaning toward using a different name. What do others think? Any thoughts on a different name?

There seems to be consensus that storm-starter and storm-kafka be brought in, so I will start there. I’ll open a pull request to bring storm-starter into an “examples” directory.

storm-kafka will be somewhat complicated by the fact that storm-kafka-0.8-plus was forked from the original source without commit history. We’ll also have to figure out both if to and how to maintain compatibility with two versions of kafka. I’ll propose starting with the original storm-kafka, preserving commit history, and we can work from there. As I mentioned previously, the author of storm-kafka-0.8-plus is willing to help out.

While I agree that https://github.com/HolmesNL/kafka-spout is worthy of consideration, it’s a little more complicated from an IP clearance perspective. For that to be an option, I believe the Netherlands Forensics Institute (the entity owning the IP), would have to donate it to the ASF and go through a formal IP clearance process.

One final note regarding Michael’s “Have Storm up and running faster than you can brew an espresso” note: Personally I think vagrant [1] is awesome for this purpose and I use it heavily for testing Storm patches, releases, etc. I while back I made the project available on github [2], I’ve just been somewhat neglectful of pushing branches and enhancements. But it’s awesome to be able to go from zero to a running storm cluster in a matter of minutes. I did something similar with Apache Whirr [3][4], but in my opinion some of the nice things about vagrant is it (and Virtualbox) is free, and if you forget and leave your clusters running, your credit card won't get dinged. (N.B.: I’m not suggesting any of the mentioned projects necessarily get pulled in, but that something along those lines could be really helpful for new users.)

- Taylor

[1] http://www.vagrantup.com
[2] https://github.com/ptgoetz/storm-vagrant
[3] https://github.com/ptgoetz/whirr-storm
[4] https://github.com/ptgoetz/whirr-kafka


On Mar 1, 2014, at 5:11 AM, Michael G. Noll <mi...@michael-noll.com> wrote:

> Thanks for starting this discussion, Taylor.
> 
> As a user of Storm (and a small-scale contributor to storm-starter) as
> well as a user of Kafka, here are my $.02.
> 
> [Storm and Kafka]
> First, I agree with Nathan that storm-kafka should be considered to be
> brought in.  While various "integrate Storm with X" options exist,
> basically everyone I have been talking to is using Kafka in
> combination with Storm.  I'm sure this is not a representative sample
> of Storm users, and of course one may or may not agree that Kafka is
> important enough of a technology in Storm's ecosystem.  Still, I do
> see the need to make sure Storm and Kafka do work together without
> having to go through forks of forks on GitHub and spending days to
> figure out how to get data from Kafka (0.8) into Storm.
>    Speaking of Kafka spout implementations, please don't forget
> https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
> We've been quite happy with the former, so I'd suggest to at least
> consider both options here (maybe the two projects can even join forces?).
> 
> [Storm examples, storm-starter]
> Second, IMHO every open source project should have a "1-click starting
> experience" for new users.  That's very much related to the project
> principles of tools like LogStash [1] who say: "Community: If a newbie
> has a bad time, it's a bug."  For this reason I personally would like
> to see the equivalent of storm-starter being brought into the "core"
> Storm project -- think of an examples/ sub-module.  If the level of
> effort is deemed too high to e.g. maintain what's already in
> storm-starter, then (say) reduce the scope and remove some of the
> examples.  In any case I'd personally would like to see bundled
> examples that are known to work with the latest version of Storm.
> storm-starter is often used to show new users how to get started with
> Storm (I used that approach in my Storm blog posts, for instance, and
> others like Mesosphere.io are even using storm-starter for their
> commercial offerings [2]).
> 
> [Have Storm up and running faster than you can brew an espresso]
> Third, for the same reason (get people up and running in a few
> minutes), I do like that other people in this thread have been
> bringing up projects like storm-deploy.  For the same reason I have
> open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
> few days ago, and I'll soon open source another Vagrant/Puppet based
> tool that provides you with 1-click local and remote deployments of
> Storm and Kafka clusters.  That's way better IMHO than having to
> follow long articles or blog posts to deploy your first cluster.  And
> there are a number of other people that have been rolling their own
> variants.  Now don't get me wrong -- I don't mention this to pitch any
> of those tools.  My intention is to say that it would be greatly
> helpful to have /something/ like this for Storm, for the same reason
> that it's nice to have LocalCluster for unit testing.  I have been
> demo'ing both Storm and Kafka by launching clusters with a simple
> command line, which always gets people excited.  If they can then rely
> on existing examples (see above) to also /run/ an analysis on "their"
> cluster then they have a beautiful start.
>    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
> VM cluster setup, too [4] so that people can run the Aurora tutorial
> on their machines in a few minutes.
> 
> [Storm and YARN]
> Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
> would be nice.  It ties into being able to run LocalCluster as well as
> to run Storm in local or remote VMs -- but now alongside your existing
> Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
> will surely be similarly attractive.
> 
> 
> On a related note bringing the Storm docs up to speed with the quality
> of the Storm code would also be great.  I have seen that since Storm
> moved to Incubator several new sections have been added such as the
> FAQ [5] (btw: nice!).
> 
> Similarly, there should be better examples and docs for users how to
> write unit tests for Storm.  Right now people seem to be cobbling
> together their test code by figuring out how the 1-year old code in
> [6] actually works, and copy-pasting other people's test code from GitHub.
> 
> --
> 
> As I said above, these are my personal $.02.  I admit that my comments
> go a bit beyond the original question of bringing in contrib modules
> -- it think implicitly the discussion about the contrib modules also
> means "what do you need to provide a better and more well-rounded
> experience", i.e. the question whether to have batteries included or
> not. (As you may suspect I'm leaning towards included at least the
> most important batteries, though what's really "important" for on the
> project-level is of course up to debate.)
> 
> On my side I'd be happy to help with those areas where I am able to
> contribute, whether that's code and examples (like storm-starter) or
> tutorials/docs (I already wrote e.g. [7] and [8]).
> 
> Again, thanks Taylor for starting this discussion.  No matter the
> actual outcome I'm sure the state of the project will be improved.
> 
> Best,
> Michael
> 
> 
> 
> [1] https://github.com/elasticsearch/logstash
> [2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
> [3] https://github.com/miguno/puppet-storm
> [4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
> [5] http://storm.incubator.apache.org/documentation/FAQ.html
> [6]
> https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
> [7]
> https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
> [8]
> http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/
> 
> 
> 
> On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
>> Thanks for the feedback Bobby.
>> 
>> To clarify, I’m mainly talking about spout/bolt/trident state 
>> implementations that integrate storm with *Technology X*, where 
>> *Technology X* is not a fundamental part of storm.
>> 
>> Examples would be technologies that are part of or related to the 
>> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>> Kafka, HDFS, HBase, Cassandra, etc.
>> 
>> The idea behind having one or more Storm committers act as a
>> “sponsor” is to make sure new additions are done carefully and with
>> good reason. To add a new module, it would require committer/PPMC
>> consensus, and assignment of one or more sponsors. Part of a
>> sponsor’s job would be to ensure that a module is maintained, which
>> would require enough familiarity with the code so support it long
>> term. If a new module was proposed, but no committers were willing
>> to act as a sponsor, it would not be added.
>> 
>> It would be the Committers’/PPMC’s responsibly to make sure things 
>> didn’t get out of hand, and to do something about it if it does.
>> 
>> Here’s an old Hadoop JIRA thread [1] discussing the addition of
>> Hive as a contrib module, similar to what happened with HBase as
>> Bobby pointed out. Some interesting points are brought up. The
>> difference here is that both HBase and Hive were pretty big
>> codebases relative to Hadoop. With spout/bolt/state implementations
>> I doubt we’d see anything along that scale.
>> 
>> - Taylor
>> 
>> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>> 
>> 
>> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
>> <ma...@yahoo-inc.com>> wrote:
>> 
>>> I can see a lot of value in having a distribution of storm that
>>> comes with batteries included, everything is tested together and
>>> you know it works.  But I don’t see much long term developer
>>> benefit in building them all together.  If there is strong
>>> coupling between storm and these external projects so that they
>>> break when storm changes then we need to understand the coupling
>>> and decide if we want to reduce that coupling by stabilizing
>>> APIs, improving version numbering and release process, etc.; or
>>> if the functionality is something that should be offered as a
>>> base service in storm.
>>> 
>>> I can see politically the value of giving these other projects a
>>> home in Apache, and making them sub-projects is the simplest
>>> route to that. I’d love to have storm on yarn inside Apache.  I
>>> just don’t want to go overboard with it.  There was a time when
>>> HBase was a “contrib” module under Hadoop along with a lot of
>>> other things, and the Apache board came and told Hadoop to brake
>>> it up.
>>> 
>>> Bringing storm-kafka into storm does not sound like it will solve
>>> much from a developer’s perspective, because there is at least as
>>> much coupling with kafka as there is with storm.  I can see how
>>> it is a huge amount of overhead and pain to set up a new project
>>> just for a few hundred lines of code, as such I am in favor of
>>> pulling in closely related projects, especially those that are
>>> spouts and state implementations. I just want to be sure that we
>>> do it carefully, with a good reason, and with enough people who
>>> are familiar with the code to support it long term.
>>> 
>>> If it starts to look like we are pulling in too many projects
>>> perhaps we should look at something more like the bigtop project 
>>> https://bigtop.apache.org/ which produces a tested distribution
>>> of Hadoop with many different sub-projects included in it.
>>> 
>>> I am also a bit concerned about these sub-projects becoming
>>> second class citizens, where we break something, but because the
>>> build is off by default we don’t know it.  I would prefer that
>>> they are built and tested by default.  If the build and test time
>>> starts to take too long, to me that means we need to start
>>> wondering if we have too many contrib modules.
>>> 
>>> —Bobby
>>> 
>>> From: Brian Enochson <brian.enochson@gmail.com 
>>> <ma...@gmail.com>>
>>> 
>>> 
> Reply-To: "user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Date: Tuesday, February 25, 2014 at 9:50 PM
>>> To: "user@storm.incubator.apache.org 
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <user@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Cc: "dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>"
>>> 
>>> 
> <dev@storm.incubator.apache.org
>>> <ma...@storm.incubator.apache.org>>
>>> 
>>> 
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>> 
>>> hi, I am in agreement with Taylor and believe I understand his
>>> intent. An incredible tool/framework/application like Storm is
>>> only enhanced and gains value from the number of well maintained
>>> and vetted modules that can be used for integration and adding
>>> further functionality. I am relatively new to the Storm community
>>> but have spent quite some time reviewing contributing modules out
>>> there, reviewing various duplicates and running into some version
>>> incompatibilities. I understand the need to keep Storm itself
>>> pure, but do think there needs to be some structure and
>>> governance added to the contributing modules. Look at the benefit
>>> a tool like npm brings to the node community. I like the idea of
>>> sponsorship, vetting and a community vote.  I, as sure many would
>>> be, am willing to offer support and time to working through how
>>> to set this up and helping with the implementation if it is
>>> decided to pursue some solution. I hope these views are taken in
>>> the sprit they are made, to make this incredible system even
>>> better along with the surrounding eco-system.
>>> 
>>> Thanks, Brian
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote: Just
>>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>>> suggesting that whatever a “contrib” project/module/subproject
>>> might become, be a clearinghouse for anything Storm-related.
>>> 
>>> I see it as something that is well-vetted by the Storm
>>> community, subject to PPMC review, vote, etc. Entry would require
>>> community review, PPMC review, and in some cases ASF IP
>>> clearance/legal review. Anything added would require some level
>>> of commitment from the PPMC/committers to provide some level of
>>> support.
>>> 
>>> In other words, nothing “willy-nilly”.
>>> 
>>> One option could be that any module added require (X > 0)  number
>>> of committers to volunteer as “sponsor”s for the module, and
>>> commit to maintaining it.
>>> 
>>> That being said, I don’t see storm-kafka being any different
>>> from anything else that provides integration points for Storm.
>>> 
>>> -Taylor
>>> 
>>> 
>>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>>> <ma...@nathanmarz.com>>
>>> wrote:
>>> 
>>> I'm only +1 for pulling in storm-kafka and updating it. Other
>>> projects put these contrib modules in a "contrib" folder and keep
>>> them managed as completely separate codebases. As it's not
>>> actually a "module" necessary for Storm, there's an argument
>>> there for doing it that way rather than via the multi-module
>>> route.
>>> 
>>> 
>>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>>> <mpathira@umail.iu.edu 
>>> <ma...@umail.iu.edu>>
>>> wrote: Hi Taylor,
>>> 
>>> I'm +1 for pulling these external libraries into Apache codebase.
>>> This will certainly benifit Strom community. I also like to
>>> contribute to this process.
>>> 
>>> Thanks Milinda
>>> 
>>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>>> <ptgoetz@gmail.com 
>>> <ma...@gmail.com>> wrote:
>>>> A while back I opened STORM-206 [1] to capture ideas for
>>>> pulling in "contrib" modules to the Apache codebase.
>>>> 
>>>> In the past, we had the storm-contrib github project [2] which 
>>>> subsequently got broken up into individual projects hosted on
>>>> the stormprocessor github group [3] and elsewhere.
>>>> 
>>>> The problem with this approach is that in certain cases it led
>>>> to code rot (modules not being updated in step with Storm's
>>>> API), fragmentation (multiple similar modules with the same
>>>> name), and confusion.
>>>> 
>>>> A good example of this is the storm-kafka module [4], since it
>>>> is a widely used component. Because storm-contrib wasn't being
>>>> tagged in github, a lot of users had trouble reconciling with
>>>> which versions of storm it was compatible. Some users built off
>>>> specific commit hashes, some forked, and a few even pushed
>>>> custom builds to repositories such as clojars. With kafka 0.8
>>>> now available, there are two main storm-kafka projects, the
>>>> original (compatible with kafka 0.7) and an updated fork [5]
>>>> (compatible with kafka 0.8).
>>>> 
>>>> My intention is not to find fault in any way, but rather to
>>>> point out the resulting pain, and work toward a better
>>>> solution.
>>>> 
>>>> I think it would be beneficial to the Storm user community to
>>>> have certain commonly used modules like storm-kafka brought
>>>> into the Apache Storm project. Another benefit worth
>>>> considering is the licensing/legal oversight that the ASF
>>>> provides, which is important to many users.
>>>> 
>>>> If this is something we want to do, then the big question
>>>> becomes what sort governance process needs to be established to
>>>> ensure that such things are properly maintained.
>>>> 
>>>> Some random thoughts, questions, etc. that jump to mind
>>>> include:
>>>> 
>>>> What to call these things: "contib modules", "connectors",
>>>> "integration modules", etc.? Build integration: I imagine they
>>>> would be a multi-module submodule of the main maven build.
>>>> Probably turned off by default and enabled by a maven profile. 
>>>> Governance: Have one or more committer volunteers responsible
>>>> for maintenance, merging patches, etc.? Proposal process for
>>>> pulling new modules?
>>>> 
>>>> 
>>>> I look forward to hearing others' opinions.
>>>> 
>>>> - Taylor
>>>> 
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>>> https://github.com/nathanmarz/storm-contrib [3]
>>>> https://github.com/stormprocessor [4]
>>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>> 
>>>> 
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Michael G. Noll" <mi...@michael-noll.com>.

Thanks for starting this discussion, Taylor.

As a user of Storm (and a small-scale contributor to storm-starter) as
well as a user of Kafka, here are my $.02.

[Storm and Kafka]
First, I agree with Nathan that storm-kafka should be considered to be
brought in.  While various "integrate Storm with X" options exist,
basically everyone I have been talking to is using Kafka in
combination with Storm.  I'm sure this is not a representative sample
of Storm users, and of course one may or may not agree that Kafka is
important enough of a technology in Storm's ecosystem.  Still, I do
see the need to make sure Storm and Kafka do work together without
having to go through forks of forks on GitHub and spending days to
figure out how to get data from Kafka (0.8) into Storm.
    Speaking of Kafka spout implementations, please don't forget
https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
 We've been quite happy with the former, so I'd suggest to at least
consider both options here (maybe the two projects can even join forces?).

[Storm examples, storm-starter]
Second, IMHO every open source project should have a "1-click starting
experience" for new users.  That's very much related to the project
principles of tools like LogStash [1] who say: "Community: If a newbie
has a bad time, it's a bug."  For this reason I personally would like
to see the equivalent of storm-starter being brought into the "core"
Storm project -- think of an examples/ sub-module.  If the level of
effort is deemed too high to e.g. maintain what's already in
storm-starter, then (say) reduce the scope and remove some of the
examples.  In any case I'd personally would like to see bundled
examples that are known to work with the latest version of Storm.
storm-starter is often used to show new users how to get started with
Storm (I used that approach in my Storm blog posts, for instance, and
others like Mesosphere.io are even using storm-starter for their
commercial offerings [2]).

[Have Storm up and running faster than you can brew an espresso]
Third, for the same reason (get people up and running in a few
minutes), I do like that other people in this thread have been
bringing up projects like storm-deploy.  For the same reason I have
open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
few days ago, and I'll soon open source another Vagrant/Puppet based
tool that provides you with 1-click local and remote deployments of
Storm and Kafka clusters.  That's way better IMHO than having to
follow long articles or blog posts to deploy your first cluster.  And
there are a number of other people that have been rolling their own
variants.  Now don't get me wrong -- I don't mention this to pitch any
of those tools.  My intention is to say that it would be greatly
helpful to have /something/ like this for Storm, for the same reason
that it's nice to have LocalCluster for unit testing.  I have been
demo'ing both Storm and Kafka by launching clusters with a simple
command line, which always gets people excited.  If they can then rely
on existing examples (see above) to also /run/ an analysis on "their"
cluster then they have a beautiful start.
    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
VM cluster setup, too [4] so that people can run the Aurora tutorial
on their machines in a few minutes.

[Storm and YARN]
Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
would be nice.  It ties into being able to run LocalCluster as well as
to run Storm in local or remote VMs -- but now alongside your existing
Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
will surely be similarly attractive.


On a related note bringing the Storm docs up to speed with the quality
of the Storm code would also be great.  I have seen that since Storm
moved to Incubator several new sections have been added such as the
FAQ [5] (btw: nice!).

Similarly, there should be better examples and docs for users how to
write unit tests for Storm.  Right now people seem to be cobbling
together their test code by figuring out how the 1-year old code in
[6] actually works, and copy-pasting other people's test code from GitHub.

--

As I said above, these are my personal $.02.  I admit that my comments
go a bit beyond the original question of bringing in contrib modules
-- it think implicitly the discussion about the contrib modules also
means "what do you need to provide a better and more well-rounded
experience", i.e. the question whether to have batteries included or
not. (As you may suspect I'm leaning towards included at least the
most important batteries, though what's really "important" for on the
project-level is of course up to debate.)

On my side I'd be happy to help with those areas where I am able to
contribute, whether that's code and examples (like storm-starter) or
tutorials/docs (I already wrote e.g. [7] and [8]).

Again, thanks Taylor for starting this discussion.  No matter the
actual outcome I'm sure the state of the project will be improved.

Best,
Michael



[1] https://github.com/elasticsearch/logstash
[2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
[3] https://github.com/miguno/puppet-storm
[4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
[5] http://storm.incubator.apache.org/documentation/FAQ.html
[6]
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
[7]
https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
[8]
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/



On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> Thanks for the feedback Bobby.
> 
> To clarify, I’m mainly talking about spout/bolt/trident state 
> implementations that integrate storm with *Technology X*, where 
> *Technology X* is not a fundamental part of storm.
> 
> Examples would be technologies that are part of or related to the 
> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
> Kafka, HDFS, HBase, Cassandra, etc.
> 
> The idea behind having one or more Storm committers act as a
> “sponsor” is to make sure new additions are done carefully and with
> good reason. To add a new module, it would require committer/PPMC
> consensus, and assignment of one or more sponsors. Part of a
> sponsor’s job would be to ensure that a module is maintained, which
> would require enough familiarity with the code so support it long
> term. If a new module was proposed, but no committers were willing
> to act as a sponsor, it would not be added.
> 
> It would be the Committers’/PPMC’s responsibly to make sure things 
> didn’t get out of hand, and to do something about it if it does.
> 
> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> Hive as a contrib module, similar to what happened with HBase as
> Bobby pointed out. Some interesting points are brought up. The
> difference here is that both HBase and Hive were pretty big
> codebases relative to Hadoop. With spout/bolt/state implementations
> I doubt we’d see anything along that scale.
> 
> - Taylor
> 
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> 
> 
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
> <ma...@yahoo-inc.com>> wrote:
> 
>> I can see a lot of value in having a distribution of storm that
>> comes with batteries included, everything is tested together and
>> you know it works.  But I don’t see much long term developer
>> benefit in building them all together.  If there is strong
>> coupling between storm and these external projects so that they
>> break when storm changes then we need to understand the coupling
>> and decide if we want to reduce that coupling by stabilizing
>> APIs, improving version numbering and release process, etc.; or
>> if the functionality is something that should be offered as a
>> base service in storm.
>> 
>> I can see politically the value of giving these other projects a
>> home in Apache, and making them sub-projects is the simplest
>> route to that. I’d love to have storm on yarn inside Apache.  I
>> just don’t want to go overboard with it.  There was a time when
>> HBase was a “contrib” module under Hadoop along with a lot of
>> other things, and the Apache board came and told Hadoop to brake
>> it up.
>> 
>> Bringing storm-kafka into storm does not sound like it will solve
>> much from a developer’s perspective, because there is at least as
>> much coupling with kafka as there is with storm.  I can see how
>> it is a huge amount of overhead and pain to set up a new project
>> just for a few hundred lines of code, as such I am in favor of
>> pulling in closely related projects, especially those that are
>> spouts and state implementations. I just want to be sure that we
>> do it carefully, with a good reason, and with enough people who
>> are familiar with the code to support it long term.
>> 
>> If it starts to look like we are pulling in too many projects
>> perhaps we should look at something more like the bigtop project 
>> https://bigtop.apache.org/ which produces a tested distribution
>> of Hadoop with many different sub-projects included in it.
>> 
>> I am also a bit concerned about these sub-projects becoming
>> second class citizens, where we break something, but because the
>> build is off by default we don’t know it.  I would prefer that
>> they are built and tested by default.  If the build and test time
>> starts to take too long, to me that means we need to start
>> wondering if we have too many contrib modules.
>> 
>> —Bobby
>> 
>> From: Brian Enochson <brian.enochson@gmail.com 
>> <ma...@gmail.com>>
>>
>> 
Reply-To: "user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Date: Tuesday, February 25, 2014 at 9:50 PM
>> To: "user@storm.incubator.apache.org 
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Cc: "dev@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<dev@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> 
>> hi, I am in agreement with Taylor and believe I understand his
>> intent. An incredible tool/framework/application like Storm is
>> only enhanced and gains value from the number of well maintained
>> and vetted modules that can be used for integration and adding
>> further functionality. I am relatively new to the Storm community
>> but have spent quite some time reviewing contributing modules out
>> there, reviewing various duplicates and running into some version
>> incompatibilities. I understand the need to keep Storm itself
>> pure, but do think there needs to be some structure and
>> governance added to the contributing modules. Look at the benefit
>> a tool like npm brings to the node community. I like the idea of
>> sponsorship, vetting and a community vote.  I, as sure many would
>> be, am willing to offer support and time to working through how
>> to set this up and helping with the implementation if it is
>> decided to pursue some solution. I hope these views are taken in
>> the sprit they are made, to make this incredible system even
>> better along with the surrounding eco-system.
>> 
>> Thanks, Brian
>> 
>> 
>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> <ptgoetz@gmail.com 
>> <ma...@gmail.com>> wrote: Just
>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>> suggesting that whatever a “contrib” project/module/subproject
>> might become, be a clearinghouse for anything Storm-related.
>> 
>> I see it as something that is well-vetted by the Storm
>> community, subject to PPMC review, vote, etc. Entry would require
>> community review, PPMC review, and in some cases ASF IP
>> clearance/legal review. Anything added would require some level
>> of commitment from the PPMC/committers to provide some level of
>> support.
>> 
>> In other words, nothing “willy-nilly”.
>> 
>> One option could be that any module added require (X > 0)  number
>> of committers to volunteer as “sponsor”s for the module, and
>> commit to maintaining it.
>> 
>> That being said, I don’t see storm-kafka being any different
>> from anything else that provides integration points for Storm.
>> 
>> -Taylor
>> 
>> 
>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>> <ma...@nathanmarz.com>>
>> wrote:
>> 
>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> projects put these contrib modules in a "contrib" folder and keep
>> them managed as completely separate codebases. As it's not
>> actually a "module" necessary for Storm, there's an argument
>> there for doing it that way rather than via the multi-module
>> route.
>> 
>> 
>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>> <mpathira@umail.iu.edu 
>> <ma...@umail.iu.edu>>
>> wrote: Hi Taylor,
>> 
>> I'm +1 for pulling these external libraries into Apache codebase.
>> This will certainly benifit Strom community. I also like to
>> contribute to this process.
>> 
>> Thanks Milinda
>> 
>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> <ptgoetz@gmail.com 
>> <ma...@gmail.com>> wrote:
>>> A while back I opened STORM-206 [1] to capture ideas for
>>> pulling in "contrib" modules to the Apache codebase.
>>> 
>>> In the past, we had the storm-contrib github project [2] which 
>>> subsequently got broken up into individual projects hosted on
>>> the stormprocessor github group [3] and elsewhere.
>>> 
>>> The problem with this approach is that in certain cases it led
>>> to code rot (modules not being updated in step with Storm's
>>> API), fragmentation (multiple similar modules with the same
>>> name), and confusion.
>>> 
>>> A good example of this is the storm-kafka module [4], since it
>>> is a widely used component. Because storm-contrib wasn't being
>>> tagged in github, a lot of users had trouble reconciling with
>>> which versions of storm it was compatible. Some users built off
>>> specific commit hashes, some forked, and a few even pushed
>>> custom builds to repositories such as clojars. With kafka 0.8
>>> now available, there are two main storm-kafka projects, the
>>> original (compatible with kafka 0.7) and an updated fork [5]
>>> (compatible with kafka 0.8).
>>> 
>>> My intention is not to find fault in any way, but rather to
>>> point out the resulting pain, and work toward a better
>>> solution.
>>> 
>>> I think it would be beneficial to the Storm user community to
>>> have certain commonly used modules like storm-kafka brought
>>> into the Apache Storm project. Another benefit worth
>>> considering is the licensing/legal oversight that the ASF
>>> provides, which is important to many users.
>>> 
>>> If this is something we want to do, then the big question
>>> becomes what sort governance process needs to be established to
>>> ensure that such things are properly maintained.
>>> 
>>> Some random thoughts, questions, etc. that jump to mind
>>> include:
>>> 
>>> What to call these things: "contib modules", "connectors",
>>> "integration modules", etc.? Build integration: I imagine they
>>> would be a multi-module submodule of the main maven build.
>>> Probably turned off by default and enabled by a maven profile. 
>>> Governance: Have one or more committer volunteers responsible
>>> for maintenance, merging patches, etc.? Proposal process for
>>> pulling new modules?
>>> 
>>> 
>>> I look forward to hearing others' opinions.
>>> 
>>> - Taylor
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>> https://github.com/nathanmarz/storm-contrib [3]
>>> https://github.com/stormprocessor [4]
>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>
>>> 
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Michael G. Noll" <mi...@michael-noll.com>.

Thanks for starting this discussion, Taylor.

As a user of Storm (and a small-scale contributor to storm-starter) as
well as a user of Kafka, here are my $.02.

[Storm and Kafka]
First, I agree with Nathan that storm-kafka should be considered to be
brought in.  While various "integrate Storm with X" options exist,
basically everyone I have been talking to is using Kafka in
combination with Storm.  I'm sure this is not a representative sample
of Storm users, and of course one may or may not agree that Kafka is
important enough of a technology in Storm's ecosystem.  Still, I do
see the need to make sure Storm and Kafka do work together without
having to go through forks of forks on GitHub and spending days to
figure out how to get data from Kafka (0.8) into Storm.
    Speaking of Kafka spout implementations, please don't forget
https://github.com/HolmesNL/kafka-spout in addition to Wurstmeister's.
 We've been quite happy with the former, so I'd suggest to at least
consider both options here (maybe the two projects can even join forces?).

[Storm examples, storm-starter]
Second, IMHO every open source project should have a "1-click starting
experience" for new users.  That's very much related to the project
principles of tools like LogStash [1] who say: "Community: If a newbie
has a bad time, it's a bug."  For this reason I personally would like
to see the equivalent of storm-starter being brought into the "core"
Storm project -- think of an examples/ sub-module.  If the level of
effort is deemed too high to e.g. maintain what's already in
storm-starter, then (say) reduce the scope and remove some of the
examples.  In any case I'd personally would like to see bundled
examples that are known to work with the latest version of Storm.
storm-starter is often used to show new users how to get started with
Storm (I used that approach in my Storm blog posts, for instance, and
others like Mesosphere.io are even using storm-starter for their
commercial offerings [2]).

[Have Storm up and running faster than you can brew an espresso]
Third, for the same reason (get people up and running in a few
minutes), I do like that other people in this thread have been
bringing up projects like storm-deploy.  For the same reason I have
open sourced puppet-storm [3] (and puppet-kafka, for that matter) a
few days ago, and I'll soon open source another Vagrant/Puppet based
tool that provides you with 1-click local and remote deployments of
Storm and Kafka clusters.  That's way better IMHO than having to
follow long articles or blog posts to deploy your first cluster.  And
there are a number of other people that have been rolling their own
variants.  Now don't get me wrong -- I don't mention this to pitch any
of those tools.  My intention is to say that it would be greatly
helpful to have /something/ like this for Storm, for the same reason
that it's nice to have LocalCluster for unit testing.  I have been
demo'ing both Storm and Kafka by launching clusters with a simple
command line, which always gets people excited.  If they can then rely
on existing examples (see above) to also /run/ an analysis on "their"
cluster then they have a beautiful start.
    Oh, and btw:  Apache Aurora (with Mesos) have such a Vagrant-based
VM cluster setup, too [4] so that people can run the Aurora tutorial
on their machines in a few minutes.

[Storm and YARN]
Fourth, and for similar reasons as #2 and #3, bringing in storm-yarn
would be nice.  It ties into being able to run LocalCluster as well as
to run Storm in local or remote VMs -- but now alongside your existing
Hadoop/YARN infrastructure.  For those preferring Mesos Storm-on-Mesos
will surely be similarly attractive.


On a related note bringing the Storm docs up to speed with the quality
of the Storm code would also be great.  I have seen that since Storm
moved to Incubator several new sections have been added such as the
FAQ [5] (btw: nice!).

Similarly, there should be better examples and docs for users how to
write unit tests for Storm.  Right now people seem to be cobbling
together their test code by figuring out how the 1-year old code in
[6] actually works, and copy-pasting other people's test code from GitHub.

--

As I said above, these are my personal $.02.  I admit that my comments
go a bit beyond the original question of bringing in contrib modules
-- it think implicitly the discussion about the contrib modules also
means "what do you need to provide a better and more well-rounded
experience", i.e. the question whether to have batteries included or
not. (As you may suspect I'm leaning towards included at least the
most important batteries, though what's really "important" for on the
project-level is of course up to debate.)

On my side I'd be happy to help with those areas where I am able to
contribute, whether that's code and examples (like storm-starter) or
tutorials/docs (I already wrote e.g. [7] and [8]).

Again, thanks Taylor for starting this discussion.  No matter the
actual outcome I'm sure the state of the project will be improved.

Best,
Michael



[1] https://github.com/elasticsearch/logstash
[2] http://mesosphere.io/learn/run-storm-on-mesos/#step-7
[3] https://github.com/miguno/puppet-storm
[4] https://github.com/apache/incubator-aurora/blob/master/docs/vagrant.md
[5] http://storm.incubator.apache.org/documentation/FAQ.html
[6]
https://github.com/xumingming/storm-lib/blob/master/src/jvm/storm/TestingApiDemo.java
[7]
https://github.com/nathanmarz/storm/wiki/Understanding-the-parallelism-of-a-Storm-topology
[8]
http://www.michael-noll.com/blog/2013/01/18/implementing-real-time-trending-topics-in-storm/



On 02/26/2014 08:21 PM, P. Taylor Goetz wrote:
> Thanks for the feedback Bobby.
> 
> To clarify, I’m mainly talking about spout/bolt/trident state 
> implementations that integrate storm with *Technology X*, where 
> *Technology X* is not a fundamental part of storm.
> 
> Examples would be technologies that are part of or related to the 
> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
> Kafka, HDFS, HBase, Cassandra, etc.
> 
> The idea behind having one or more Storm committers act as a
> “sponsor” is to make sure new additions are done carefully and with
> good reason. To add a new module, it would require committer/PPMC
> consensus, and assignment of one or more sponsors. Part of a
> sponsor’s job would be to ensure that a module is maintained, which
> would require enough familiarity with the code so support it long
> term. If a new module was proposed, but no committers were willing
> to act as a sponsor, it would not be added.
> 
> It would be the Committers’/PPMC’s responsibly to make sure things 
> didn’t get out of hand, and to do something about it if it does.
> 
> Here’s an old Hadoop JIRA thread [1] discussing the addition of
> Hive as a contrib module, similar to what happened with HBase as
> Bobby pointed out. Some interesting points are brought up. The
> difference here is that both HBase and Hive were pretty big
> codebases relative to Hadoop. With spout/bolt/state implementations
> I doubt we’d see anything along that scale.
> 
> - Taylor
> 
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> 
> 
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com 
> <ma...@yahoo-inc.com>> wrote:
> 
>> I can see a lot of value in having a distribution of storm that
>> comes with batteries included, everything is tested together and
>> you know it works.  But I don’t see much long term developer
>> benefit in building them all together.  If there is strong
>> coupling between storm and these external projects so that they
>> break when storm changes then we need to understand the coupling
>> and decide if we want to reduce that coupling by stabilizing
>> APIs, improving version numbering and release process, etc.; or
>> if the functionality is something that should be offered as a
>> base service in storm.
>> 
>> I can see politically the value of giving these other projects a
>> home in Apache, and making them sub-projects is the simplest
>> route to that. I’d love to have storm on yarn inside Apache.  I
>> just don’t want to go overboard with it.  There was a time when
>> HBase was a “contrib” module under Hadoop along with a lot of
>> other things, and the Apache board came and told Hadoop to brake
>> it up.
>> 
>> Bringing storm-kafka into storm does not sound like it will solve
>> much from a developer’s perspective, because there is at least as
>> much coupling with kafka as there is with storm.  I can see how
>> it is a huge amount of overhead and pain to set up a new project
>> just for a few hundred lines of code, as such I am in favor of
>> pulling in closely related projects, especially those that are
>> spouts and state implementations. I just want to be sure that we
>> do it carefully, with a good reason, and with enough people who
>> are familiar with the code to support it long term.
>> 
>> If it starts to look like we are pulling in too many projects
>> perhaps we should look at something more like the bigtop project 
>> https://bigtop.apache.org/ which produces a tested distribution
>> of Hadoop with many different sub-projects included in it.
>> 
>> I am also a bit concerned about these sub-projects becoming
>> second class citizens, where we break something, but because the
>> build is off by default we don’t know it.  I would prefer that
>> they are built and tested by default.  If the build and test time
>> starts to take too long, to me that means we need to start
>> wondering if we have too many contrib modules.
>> 
>> —Bobby
>> 
>> From: Brian Enochson <brian.enochson@gmail.com 
>> <ma...@gmail.com>>
>>
>> 
Reply-To: "user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Date: Tuesday, February 25, 2014 at 9:50 PM
>> To: "user@storm.incubator.apache.org 
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<user@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Cc: "dev@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>"
>>
>> 
<dev@storm.incubator.apache.org
>> <ma...@storm.incubator.apache.org>>
>>
>> 
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>> 
>> hi, I am in agreement with Taylor and believe I understand his
>> intent. An incredible tool/framework/application like Storm is
>> only enhanced and gains value from the number of well maintained
>> and vetted modules that can be used for integration and adding
>> further functionality. I am relatively new to the Storm community
>> but have spent quite some time reviewing contributing modules out
>> there, reviewing various duplicates and running into some version
>> incompatibilities. I understand the need to keep Storm itself
>> pure, but do think there needs to be some structure and
>> governance added to the contributing modules. Look at the benefit
>> a tool like npm brings to the node community. I like the idea of
>> sponsorship, vetting and a community vote.  I, as sure many would
>> be, am willing to offer support and time to working through how
>> to set this up and helping with the implementation if it is
>> decided to pursue some solution. I hope these views are taken in
>> the sprit they are made, to make this incredible system even
>> better along with the surrounding eco-system.
>> 
>> Thanks, Brian
>> 
>> 
>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
>> <ptgoetz@gmail.com 
>> <ma...@gmail.com>> wrote: Just
>> to be clear (and play a little Devil’s advocate :) ), I’m not 
>> suggesting that whatever a “contrib” project/module/subproject
>> might become, be a clearinghouse for anything Storm-related.
>> 
>> I see it as something that is well-vetted by the Storm
>> community, subject to PPMC review, vote, etc. Entry would require
>> community review, PPMC review, and in some cases ASF IP
>> clearance/legal review. Anything added would require some level
>> of commitment from the PPMC/committers to provide some level of
>> support.
>> 
>> In other words, nothing “willy-nilly”.
>> 
>> One option could be that any module added require (X > 0)  number
>> of committers to volunteer as “sponsor”s for the module, and
>> commit to maintaining it.
>> 
>> That being said, I don’t see storm-kafka being any different
>> from anything else that provides integration points for Storm.
>> 
>> -Taylor
>> 
>> 
>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com 
>> <ma...@nathanmarz.com>>
>> wrote:
>> 
>> I'm only +1 for pulling in storm-kafka and updating it. Other
>> projects put these contrib modules in a "contrib" folder and keep
>> them managed as completely separate codebases. As it's not
>> actually a "module" necessary for Storm, there's an argument
>> there for doing it that way rather than via the multi-module
>> route.
>> 
>> 
>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
>> <mpathira@umail.iu.edu 
>> <ma...@umail.iu.edu>>
>> wrote: Hi Taylor,
>> 
>> I'm +1 for pulling these external libraries into Apache codebase.
>> This will certainly benifit Strom community. I also like to
>> contribute to this process.
>> 
>> Thanks Milinda
>> 
>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
>> <ptgoetz@gmail.com 
>> <ma...@gmail.com>> wrote:
>>> A while back I opened STORM-206 [1] to capture ideas for
>>> pulling in "contrib" modules to the Apache codebase.
>>> 
>>> In the past, we had the storm-contrib github project [2] which 
>>> subsequently got broken up into individual projects hosted on
>>> the stormprocessor github group [3] and elsewhere.
>>> 
>>> The problem with this approach is that in certain cases it led
>>> to code rot (modules not being updated in step with Storm's
>>> API), fragmentation (multiple similar modules with the same
>>> name), and confusion.
>>> 
>>> A good example of this is the storm-kafka module [4], since it
>>> is a widely used component. Because storm-contrib wasn't being
>>> tagged in github, a lot of users had trouble reconciling with
>>> which versions of storm it was compatible. Some users built off
>>> specific commit hashes, some forked, and a few even pushed
>>> custom builds to repositories such as clojars. With kafka 0.8
>>> now available, there are two main storm-kafka projects, the
>>> original (compatible with kafka 0.7) and an updated fork [5]
>>> (compatible with kafka 0.8).
>>> 
>>> My intention is not to find fault in any way, but rather to
>>> point out the resulting pain, and work toward a better
>>> solution.
>>> 
>>> I think it would be beneficial to the Storm user community to
>>> have certain commonly used modules like storm-kafka brought
>>> into the Apache Storm project. Another benefit worth
>>> considering is the licensing/legal oversight that the ASF
>>> provides, which is important to many users.
>>> 
>>> If this is something we want to do, then the big question
>>> becomes what sort governance process needs to be established to
>>> ensure that such things are properly maintained.
>>> 
>>> Some random thoughts, questions, etc. that jump to mind
>>> include:
>>> 
>>> What to call these things: "contib modules", "connectors",
>>> "integration modules", etc.? Build integration: I imagine they
>>> would be a multi-module submodule of the main maven build.
>>> Probably turned off by default and enabled by a maven profile. 
>>> Governance: Have one or more committer volunteers responsible
>>> for maintenance, merging patches, etc.? Proposal process for
>>> pulling new modules?
>>> 
>>> 
>>> I look forward to hearing others' opinions.
>>> 
>>> - Taylor
>>> 
>>> 
>>> [1] https://issues.apache.org/jira/browse/STORM-206 [2]
>>> https://github.com/nathanmarz/storm-contrib [3]
>>> https://github.com/stormprocessor [4]
>>> https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>>>
>>> 
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus

RE: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Huang, Roger" <ro...@visa.com>.

Bobby,
I vote to include both storm-yarn and storm-deploy.
Roger


-----Original Message-----
From: Brian O'Neill [mailto:boneill42@gmail.com] On Behalf Of Brian O'Neill
Sent: Wednesday, February 26, 2014 3:39 PM
To: dev@storm.incubator.apache.org
Cc: user@storm.incubator.apache.org
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache


Bobby,

FWIW, I¹d love to see storm-yarn inside.  I think we could definitely make things easier on the end-user if they were more cohesive.

e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would kickoff a storm-yarn launch, with whatever version was built.  It would likely simplify the ³create-tarball² and storm-yarn getStormConfig process as well.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  € healthmarketscience.com

This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
 






On 2/26/14, 4:25 PM, "Bobby Evans" <ev...@yahoo-inc.com> wrote:

>I totally agree and I am +1 on bringing these spout/trident pieces in, 
>assuming there are committers to support them.
>
>I am also curious about how people feel about pulling in other projects 
>like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
>Storm-starter in my option seems more like documentation and it would 
>be nice to pull in so that it stays up to date with storm itself, just 
>like the documentation.
>
>The others are more of ways to run storm in different environments.  
>They seem like there could be a lot of coupling between them and storm 
>as storm evolves, and they kind of fit with "integrate storm with 
>*Technology X*² except X in this case is a compute environment instead 
>of a data source or store. But then again we also just shot down a 
>request to create juju charms for storm.
>
>‹Bobby
>
>From: "P. Taylor Goetz" <pt...@gmail.com>>
>Reply-To: 
><de...@storm.incubator.apache.org>>
>Date: Wednesday, February 26, 2014 at 1:21 PM
>To: 
><de...@storm.incubator.apache.org>>
>Cc: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>Thanks for the feedback Bobby.
>
>To clarify, I¹m mainly talking about spout/bolt/trident state 
>implementations that integrate storm with *Technology X*, where 
>*Technology X* is not a fundamental part of storm.
>
>Examples would be technologies that are part of or related to the 
>Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>Kafka, HDFS, HBase, Cassandra, etc.
>
>The idea behind having one or more Storm committers act as a ³sponsor² 
>is to make sure new additions are done carefully and with good reason. 
>To add a new module, it would require committer/PPMC consensus, and 
>assignment of one or more sponsors. Part of a sponsor¹s job would be to 
>ensure that a module is maintained, which would require enough 
>familiarity with the code so support it long term. If a new module was 
>proposed, but no committers were willing to act as a sponsor, it would 
>not be added.
>
>It would be the Committers¹/PPMC¹s responsibly to make sure things 
>didn¹t get out of hand, and to do something about it if it does.
>
>Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as 
>a contrib module, similar to what happened with HBase as Bobby pointed out.
>Some interesting points are brought up. The difference here is that 
>both HBase and Hive were pretty big codebases relative to Hadoop. With 
>spout/bolt/state implementations I doubt we¹d see anything along that 
>scale.
>
>- Taylor
>
>[1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
>On Feb 26, 2014, at 12:35 PM, Bobby Evans 
><ev...@yahoo-inc.com>> wrote:
>
>I can see a lot of value in having a distribution of storm that comes 
>with batteries included, everything is tested together and you know it 
>works.  But I don¹t see much long term developer benefit in building 
>them all together.  If there is strong coupling between storm and these 
>external projects so that they break when storm changes then we need to 
>understand the coupling and decide if we want to reduce that coupling 
>by stabilizing APIs, improving version numbering and release process, 
>etc.; or if the functionality is something that should be offered as a 
>base service in storm.
>
>I can see politically the value of giving these other projects a home 
>in Apache, and making them sub-projects is the simplest route to that.  
>I¹d love to have storm on yarn inside Apache.  I just don¹t want to go 
>overboard with it.  There was a time when HBase was a ³contrib² module 
>under Hadoop along with a lot of other things, and the Apache board 
>came and told Hadoop to brake it up.
>
>Bringing storm-kafka into storm does not sound like it will solve much 
>from a developer¹s perspective, because there is at least as much 
>coupling with kafka as there is with storm.  I can see how it is a huge 
>amount of overhead and pain to set up a new project just for a few 
>hundred lines of code, as such I am in favor of pulling in closely 
>related projects, especially those that are spouts and state 
>implementations. I just want to be sure that we do it carefully, with a 
>good reason, and with enough people who are familiar with the code to 
>support it long term.
>
>If it starts to look like we are pulling in too many projects perhaps 
>we should look at something more like the bigtop project 
>https://bigtop.apache.org/ which produces a tested distribution of 
>Hadoop with many different sub-projects included in it.
>
>I am also a bit concerned about these sub-projects becoming second 
>class citizens, where we break something, but because the build is off 
>by default we don¹t know it.  I would prefer that they are built and 
>tested by default.  If the build and test time starts to take too long, 
>to me that means we need to start wondering if we have too many contrib modules.
>
>‹Bobby
>
>From: Brian Enochson
><br...@gmail.com><mailto:brian
>.en
>ochson@gmail.com>>
>Reply-To: 
>"user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>>
>Date: Tuesday, February 25, 2014 at 9:50 PM
>To: 
>"user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>>
>Cc: 
>"dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org><
>mai
>lto:dev@storm.incubator.apache.org>"
><de...@storm.incubator.apache.org><
>mai
>lto:dev@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>hi,
>  I am in agreement with Taylor and believe I understand his intent. An 
>incredible tool/framework/application like Storm is only enhanced and 
>gains value from the number of well maintained and vetted modules that 
>can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some 
>time reviewing contributing modules out there, reviewing various 
>duplicates and running into some version incompatibilities. I 
>understand the need to keep Storm itself pure, but do think there needs 
>to be some structure and governance added to the contributing modules. 
>Look at the benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as 
>sure many would be, am willing to offer support and time to working 
>through how to set this up and helping with the implementation if it is 
>decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this 
>incredible system even better along with the surrounding eco-system.
>
>Thanks,
>Brian
>
>
>On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz 
><pt...@gmail.com>>
>wrote:
>Just to be clear (and play a little Devil¹s advocate :) ), I¹m not 
>suggesting that whatever a ³contrib² project/module/subproject might 
>become, be a clearinghouse for anything Storm-related.
>
>I see it as something that is well-vetted by the Storm community, 
>subject to PPMC review, vote, etc. Entry would require community 
>review, PPMC review, and in some cases ASF IP clearance/legal review. 
>Anything added would require some level of commitment from the 
>PPMC/committers to provide some level of support.
>
>In other words, nothing ³willy-nilly².
>
>One option could be that any module added require (X > 0)  number of 
>committers to volunteer as ³sponsor²s for the module, and commit to 
>maintaining it.
>
>That being said, I don¹t see storm-kafka being any different from 
>anything else that provides integration points for Storm.
>
>-Taylor
>
>
>On Feb 25, 2014, at 7:53 PM, Nathan Marz 
><na...@nathanmarz.com><mailto:nathan@nath
>anm
>arz.com>> wrote:
>
>I'm only +1 for pulling in storm-kafka and updating it. Other projects 
>put these contrib modules in a "contrib" folder and keep them managed 
>as completely separate codebases. As it's not actually a "module" 
>necessary for Storm, there's an argument there for doing it that way 
>rather than via the multi-module route.
>
>
>On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
><mp...@umail.iu.edu><mailto:mpathira@um
>ail
>.iu.edu>> wrote:
>Hi Taylor,
>
>I'm +1 for pulling these external libraries into Apache codebase. This 
>will certainly benifit Strom community. I also like to contribute to 
>this process.
>
>Thanks
>Milinda
>
>On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz 
><pt...@gmail.com>>
>wrote:
>A while back I opened STORM-206 [1] to capture ideas for pulling in 
>"contrib" modules to the Apache codebase.
>
>In the past, we had the storm-contrib github project [2] which 
>subsequently got broken up into individual projects hosted on the 
>stormprocessor github group [3] and elsewhere.
>
>The problem with this approach is that in certain cases it led to code 
>rot (modules not being updated in step with Storm's API), fragmentation 
>(multiple similar modules with the same name), and confusion.
>
>A good example of this is the storm-kafka module [4], since it is a 
>widely used component. Because storm-contrib wasn't being tagged in 
>github, a lot of users had trouble reconciling with which versions of 
>storm it was compatible. Some users built off specific commit hashes, 
>some forked, and a few even pushed custom builds to repositories such 
>as clojars. With kafka
>0.8 now available, there are two main storm-kafka projects, the 
>original (compatible with kafka 0.7) and an updated fork [5] 
>(compatible with kafka 0.8).
>
>My intention is not to find fault in any way, but rather to point out 
>the resulting pain, and work toward a better solution.
>
>I think it would be beneficial to the Storm user community to have 
>certain commonly used modules like storm-kafka brought into the Apache 
>Storm project. Another benefit worth considering is the licensing/legal 
>oversight that the ASF provides, which is important to many users.
>
>If this is something we want to do, then the big question becomes what 
>sort governance process needs to be established to ensure that such 
>things are properly maintained.
>
>Some random thoughts, questions, etc. that jump to mind include:
>
>What to call these things: "contib modules", "connectors", "integration 
>modules", etc.?
>Build integration: I imagine they would be a multi-module submodule of 
>the main maven build. Probably turned off by default and enabled by a 
>maven profile.
>Governance: Have one or more committer volunteers responsible for 
>maintenance, merging patches, etc.? Proposal process for pulling new 
>modules?
>
>
>I look forward to hearing others' opinions.
>
>- Taylor
>
>
>[1] https://issues.apache.org/jira/browse/STORM-206
>[2] https://github.com/nathanmarz/storm-contrib
>[3] https://github.com/stormprocessor
>[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>[5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
>--
>Milinda Pathirage
>
>PhD Student | Research Assistant
>School of Informatics and Computing | Data to Insight Center Indiana 
>University
>
>twitter: milindalakmal
>skype: milinda.pathirage
>blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
>--
>Twitter: @nathanmarz
>http://nathanmarz.com<http://nathanmarz.com/>
>

RE: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Huang, Roger" <ro...@visa.com>.

Bobby,
I vote to include both storm-yarn and storm-deploy.
Roger


-----Original Message-----
From: Brian O'Neill [mailto:boneill42@gmail.com] On Behalf Of Brian O'Neill
Sent: Wednesday, February 26, 2014 3:39 PM
To: dev@storm.incubator.apache.org
Cc: user@storm.incubator.apache.org
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache


Bobby,

FWIW, I¹d love to see storm-yarn inside.  I think we could definitely make things easier on the end-user if they were more cohesive.

e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would kickoff a storm-yarn launch, with whatever version was built.  It would likely simplify the ³create-tarball² and storm-yarn getStormConfig process as well.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive € King of Prussia, PA € 19406
M: 215.588.6024 € @boneill42 <http://www.twitter.com/boneill42>  € healthmarketscience.com

This information transmitted in this email message is for the intended recipient only and may contain confidential and/or privileged material. If you received this email in error and are not the intended recipient, or the person responsible to deliver it to the intended recipient, please contact the sender at the email above and delete this email and any attachments and destroy any copies thereof. Any review, retransmission, dissemination, copying or other use of, or taking any action in reliance upon, this information by persons or entities other than the intended recipient is strictly prohibited.
 






On 2/26/14, 4:25 PM, "Bobby Evans" <ev...@yahoo-inc.com> wrote:

>I totally agree and I am +1 on bringing these spout/trident pieces in, 
>assuming there are committers to support them.
>
>I am also curious about how people feel about pulling in other projects 
>like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
>Storm-starter in my option seems more like documentation and it would 
>be nice to pull in so that it stays up to date with storm itself, just 
>like the documentation.
>
>The others are more of ways to run storm in different environments.  
>They seem like there could be a lot of coupling between them and storm 
>as storm evolves, and they kind of fit with "integrate storm with 
>*Technology X*² except X in this case is a compute environment instead 
>of a data source or store. But then again we also just shot down a 
>request to create juju charms for storm.
>
>‹Bobby
>
>From: "P. Taylor Goetz" <pt...@gmail.com>>
>Reply-To: 
><de...@storm.incubator.apache.org>>
>Date: Wednesday, February 26, 2014 at 1:21 PM
>To: 
><de...@storm.incubator.apache.org>>
>Cc: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>Thanks for the feedback Bobby.
>
>To clarify, I¹m mainly talking about spout/bolt/trident state 
>implementations that integrate storm with *Technology X*, where 
>*Technology X* is not a fundamental part of storm.
>
>Examples would be technologies that are part of or related to the 
>Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: 
>Kafka, HDFS, HBase, Cassandra, etc.
>
>The idea behind having one or more Storm committers act as a ³sponsor² 
>is to make sure new additions are done carefully and with good reason. 
>To add a new module, it would require committer/PPMC consensus, and 
>assignment of one or more sponsors. Part of a sponsor¹s job would be to 
>ensure that a module is maintained, which would require enough 
>familiarity with the code so support it long term. If a new module was 
>proposed, but no committers were willing to act as a sponsor, it would 
>not be added.
>
>It would be the Committers¹/PPMC¹s responsibly to make sure things 
>didn¹t get out of hand, and to do something about it if it does.
>
>Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as 
>a contrib module, similar to what happened with HBase as Bobby pointed out.
>Some interesting points are brought up. The difference here is that 
>both HBase and Hive were pretty big codebases relative to Hadoop. With 
>spout/bolt/state implementations I doubt we¹d see anything along that 
>scale.
>
>- Taylor
>
>[1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
>On Feb 26, 2014, at 12:35 PM, Bobby Evans 
><ev...@yahoo-inc.com>> wrote:
>
>I can see a lot of value in having a distribution of storm that comes 
>with batteries included, everything is tested together and you know it 
>works.  But I don¹t see much long term developer benefit in building 
>them all together.  If there is strong coupling between storm and these 
>external projects so that they break when storm changes then we need to 
>understand the coupling and decide if we want to reduce that coupling 
>by stabilizing APIs, improving version numbering and release process, 
>etc.; or if the functionality is something that should be offered as a 
>base service in storm.
>
>I can see politically the value of giving these other projects a home 
>in Apache, and making them sub-projects is the simplest route to that.  
>I¹d love to have storm on yarn inside Apache.  I just don¹t want to go 
>overboard with it.  There was a time when HBase was a ³contrib² module 
>under Hadoop along with a lot of other things, and the Apache board 
>came and told Hadoop to brake it up.
>
>Bringing storm-kafka into storm does not sound like it will solve much 
>from a developer¹s perspective, because there is at least as much 
>coupling with kafka as there is with storm.  I can see how it is a huge 
>amount of overhead and pain to set up a new project just for a few 
>hundred lines of code, as such I am in favor of pulling in closely 
>related projects, especially those that are spouts and state 
>implementations. I just want to be sure that we do it carefully, with a 
>good reason, and with enough people who are familiar with the code to 
>support it long term.
>
>If it starts to look like we are pulling in too many projects perhaps 
>we should look at something more like the bigtop project 
>https://bigtop.apache.org/ which produces a tested distribution of 
>Hadoop with many different sub-projects included in it.
>
>I am also a bit concerned about these sub-projects becoming second 
>class citizens, where we break something, but because the build is off 
>by default we don¹t know it.  I would prefer that they are built and 
>tested by default.  If the build and test time starts to take too long, 
>to me that means we need to start wondering if we have too many contrib modules.
>
>‹Bobby
>
>From: Brian Enochson
><br...@gmail.com><mailto:brian
>.en
>ochson@gmail.com>>
>Reply-To: 
>"user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>>
>Date: Tuesday, February 25, 2014 at 9:50 PM
>To: 
>"user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>"
><user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
>><m ailto:user@storm.incubator.apache.org>>
>Cc: 
>"dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org><
>mai
>lto:dev@storm.incubator.apache.org>"
><de...@storm.incubator.apache.org><
>mai
>lto:dev@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>hi,
>  I am in agreement with Taylor and believe I understand his intent. An 
>incredible tool/framework/application like Storm is only enhanced and 
>gains value from the number of well maintained and vetted modules that 
>can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some 
>time reviewing contributing modules out there, reviewing various 
>duplicates and running into some version incompatibilities. I 
>understand the need to keep Storm itself pure, but do think there needs 
>to be some structure and governance added to the contributing modules. 
>Look at the benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as 
>sure many would be, am willing to offer support and time to working 
>through how to set this up and helping with the implementation if it is 
>decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this 
>incredible system even better along with the surrounding eco-system.
>
>Thanks,
>Brian
>
>
>On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz 
><pt...@gmail.com>>
>wrote:
>Just to be clear (and play a little Devil¹s advocate :) ), I¹m not 
>suggesting that whatever a ³contrib² project/module/subproject might 
>become, be a clearinghouse for anything Storm-related.
>
>I see it as something that is well-vetted by the Storm community, 
>subject to PPMC review, vote, etc. Entry would require community 
>review, PPMC review, and in some cases ASF IP clearance/legal review. 
>Anything added would require some level of commitment from the 
>PPMC/committers to provide some level of support.
>
>In other words, nothing ³willy-nilly².
>
>One option could be that any module added require (X > 0)  number of 
>committers to volunteer as ³sponsor²s for the module, and commit to 
>maintaining it.
>
>That being said, I don¹t see storm-kafka being any different from 
>anything else that provides integration points for Storm.
>
>-Taylor
>
>
>On Feb 25, 2014, at 7:53 PM, Nathan Marz 
><na...@nathanmarz.com><mailto:nathan@nath
>anm
>arz.com>> wrote:
>
>I'm only +1 for pulling in storm-kafka and updating it. Other projects 
>put these contrib modules in a "contrib" folder and keep them managed 
>as completely separate codebases. As it's not actually a "module" 
>necessary for Storm, there's an argument there for doing it that way 
>rather than via the multi-module route.
>
>
>On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage 
><mp...@umail.iu.edu><mailto:mpathira@um
>ail
>.iu.edu>> wrote:
>Hi Taylor,
>
>I'm +1 for pulling these external libraries into Apache codebase. This 
>will certainly benifit Strom community. I also like to contribute to 
>this process.
>
>Thanks
>Milinda
>
>On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz 
><pt...@gmail.com>>
>wrote:
>A while back I opened STORM-206 [1] to capture ideas for pulling in 
>"contrib" modules to the Apache codebase.
>
>In the past, we had the storm-contrib github project [2] which 
>subsequently got broken up into individual projects hosted on the 
>stormprocessor github group [3] and elsewhere.
>
>The problem with this approach is that in certain cases it led to code 
>rot (modules not being updated in step with Storm's API), fragmentation 
>(multiple similar modules with the same name), and confusion.
>
>A good example of this is the storm-kafka module [4], since it is a 
>widely used component. Because storm-contrib wasn't being tagged in 
>github, a lot of users had trouble reconciling with which versions of 
>storm it was compatible. Some users built off specific commit hashes, 
>some forked, and a few even pushed custom builds to repositories such 
>as clojars. With kafka
>0.8 now available, there are two main storm-kafka projects, the 
>original (compatible with kafka 0.7) and an updated fork [5] 
>(compatible with kafka 0.8).
>
>My intention is not to find fault in any way, but rather to point out 
>the resulting pain, and work toward a better solution.
>
>I think it would be beneficial to the Storm user community to have 
>certain commonly used modules like storm-kafka brought into the Apache 
>Storm project. Another benefit worth considering is the licensing/legal 
>oversight that the ASF provides, which is important to many users.
>
>If this is something we want to do, then the big question becomes what 
>sort governance process needs to be established to ensure that such 
>things are properly maintained.
>
>Some random thoughts, questions, etc. that jump to mind include:
>
>What to call these things: "contib modules", "connectors", "integration 
>modules", etc.?
>Build integration: I imagine they would be a multi-module submodule of 
>the main maven build. Probably turned off by default and enabled by a 
>maven profile.
>Governance: Have one or more committer volunteers responsible for 
>maintenance, merging patches, etc.? Proposal process for pulling new 
>modules?
>
>
>I look forward to hearing others' opinions.
>
>- Taylor
>
>
>[1] https://issues.apache.org/jira/browse/STORM-206
>[2] https://github.com/nathanmarz/storm-contrib
>[3] https://github.com/stormprocessor
>[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>[5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
>--
>Milinda Pathirage
>
>PhD Student | Research Assistant
>School of Informatics and Computing | Data to Insight Center Indiana 
>University
>
>twitter: milindalakmal
>skype: milinda.pathirage
>blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
>--
>Twitter: @nathanmarz
>http://nathanmarz.com<http://nathanmarz.com/>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Bobby,

FWIW, I¹d love to see storm-yarn inside.  I think we could definitely make
things easier on the end-user if they were more cohesive.

e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would
kickoff a storm-yarn launch, with whatever version was built.  It would
likely simplify the ³create-tarball² and storm-yarn getStormConfig process
as well.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 2/26/14, 4:25 PM, "Bobby Evans" <ev...@yahoo-inc.com> wrote:

>I totally agree and I am +1 on bringing these spout/trident pieces in,
>assuming there are committers to support them.
>
>I am also curious about how people feel about pulling in other projects
>like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
>Storm-starter in my option seems more like documentation and it would be
>nice to pull in so that it stays up to date with storm itself, just like
>the documentation.
>
>The others are more of ways to run storm in different environments.  They
>seem like there could be a lot of coupling between them and storm as
>storm evolves, and they kind of fit with "integrate storm with
>*Technology X*² except X in this case is a compute environment instead of
>a data source or store. But then again we also just shot down a request
>to create juju charms for storm.
>
>Bobby
>
>From: "P. Taylor Goetz" <pt...@gmail.com>>
>Reply-To: 
><de...@storm.incubator.apache.org>>
>Date: Wednesday, February 26, 2014 at 1:21 PM
>To: 
><de...@storm.incubator.apache.org>>
>Cc: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>Thanks for the feedback Bobby.
>
>To clarify, I¹m mainly talking about spout/bolt/trident state
>implementations that integrate storm with *Technology X*, where
>*Technology X* is not a fundamental part of storm.
>
>Examples would be technologies that are part of or related to the
>Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka,
>HDFS, HBase, Cassandra, etc.
>
>The idea behind having one or more Storm committers act as a ³sponsor² is
>to make sure new additions are done carefully and with good reason. To
>add a new module, it would require committer/PPMC consensus, and
>assignment of one or more sponsors. Part of a sponsor¹s job would be to
>ensure that a module is maintained, which would require enough
>familiarity with the code so support it long term. If a new module was
>proposed, but no committers were willing to act as a sponsor, it would
>not be added.
>
>It would be the Committers¹/PPMC¹s responsibly to make sure things didn¹t
>get out of hand, and to do something about it if it does.
>
>Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as a
>contrib module, similar to what happened with HBase as Bobby pointed out.
>Some interesting points are brought up. The difference here is that both
>HBase and Hive were pretty big codebases relative to Hadoop. With
>spout/bolt/state implementations I doubt we¹d see anything along that
>scale.
>
>- Taylor
>
>[1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
>On Feb 26, 2014, at 12:35 PM, Bobby Evans
><ev...@yahoo-inc.com>> wrote:
>
>I can see a lot of value in having a distribution of storm that comes
>with batteries included, everything is tested together and you know it
>works.  But I don¹t see much long term developer benefit in building them
>all together.  If there is strong coupling between storm and these
>external projects so that they break when storm changes then we need to
>understand the coupling and decide if we want to reduce that coupling by
>stabilizing APIs, improving version numbering and release process, etc.;
>or if the functionality is something that should be offered as a base
>service in storm.
>
>I can see politically the value of giving these other projects a home in
>Apache, and making them sub-projects is the simplest route to that.  I¹d
>love to have storm on yarn inside Apache.  I just don¹t want to go
>overboard with it.  There was a time when HBase was a ³contrib² module
>under Hadoop along with a lot of other things, and the Apache board came
>and told Hadoop to brake it up.
>
>Bringing storm-kafka into storm does not sound like it will solve much
>from a developer¹s perspective, because there is at least as much
>coupling with kafka as there is with storm.  I can see how it is a huge
>amount of overhead and pain to set up a new project just for a few
>hundred lines of code, as such I am in favor of pulling in closely
>related projects, especially those that are spouts and state
>implementations. I just want to be sure that we do it carefully, with a
>good reason, and with enough people who are familiar with the code to
>support it long term.
>
>If it starts to look like we are pulling in too many projects perhaps we
>should look at something more like the bigtop project
>https://bigtop.apache.org/ which produces a tested distribution of Hadoop
>with many different sub-projects included in it.
>
>I am also a bit concerned about these sub-projects becoming second class
>citizens, where we break something, but because the build is off by
>default we don¹t know it.  I would prefer that they are built and tested
>by default.  If the build and test time starts to take too long, to me
>that means we need to start wondering if we have too many contrib modules.
>
>Bobby
>
>From: Brian Enochson
><br...@gmail.com><mailto:brian.en
>ochson@gmail.com>>
>Reply-To: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>>
>Date: Tuesday, February 25, 2014 at 9:50 PM
>To: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>>
>Cc: 
>"dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org><mai
>lto:dev@storm.incubator.apache.org>"
><de...@storm.incubator.apache.org><mai
>lto:dev@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>hi,
>  I am in agreement with Taylor and believe I understand his intent. An
>incredible tool/framework/application like Storm is only enhanced and
>gains value from the number of well maintained and vetted modules that
>can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some
>time reviewing contributing modules out there, reviewing various
>duplicates and running into some version incompatibilities. I understand
>the need to keep Storm itself pure, but do think there needs to be some
>structure and governance added to the contributing modules. Look at the
>benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as
>sure many would be, am willing to offer support and time to working
>through how to set this up and helping with the implementation if it is
>decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this
>incredible system even better along with the surrounding eco-system.
>
>Thanks,
>Brian
>
>
>On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
><pt...@gmail.com>>
>wrote:
>Just to be clear (and play a little Devil¹s advocate :) ), I¹m not
>suggesting that whatever a ³contrib² project/module/subproject might
>become, be a clearinghouse for anything Storm-related.
>
>I see it as something that is well-vetted by the Storm community, subject
>to PPMC review, vote, etc. Entry would require community review, PPMC
>review, and in some cases ASF IP clearance/legal review. Anything added
>would require some level of commitment from the PPMC/committers to
>provide some level of support.
>
>In other words, nothing ³willy-nilly².
>
>One option could be that any module added require (X > 0)  number of
>committers to volunteer as ³sponsor²s for the module, and commit to
>maintaining it.
>
>That being said, I don¹t see storm-kafka being any different from
>anything else that provides integration points for Storm.
>
>-Taylor
>
>
>On Feb 25, 2014, at 7:53 PM, Nathan Marz
><na...@nathanmarz.com><mailto:nathan@nathanm
>arz.com>> wrote:
>
>I'm only +1 for pulling in storm-kafka and updating it. Other projects
>put these contrib modules in a "contrib" folder and keep them managed as
>completely separate codebases. As it's not actually a "module" necessary
>for Storm, there's an argument there for doing it that way rather than
>via the multi-module route.
>
>
>On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
><mp...@umail.iu.edu><mailto:mpathira@umail
>.iu.edu>> wrote:
>Hi Taylor,
>
>I'm +1 for pulling these external libraries into Apache codebase. This
>will certainly benifit Strom community. I also like to contribute to
>this process.
>
>Thanks
>Milinda
>
>On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
><pt...@gmail.com>>
>wrote:
>A while back I opened STORM-206 [1] to capture ideas for pulling in
>"contrib" modules to the Apache codebase.
>
>In the past, we had the storm-contrib github project [2] which
>subsequently
>got broken up into individual projects hosted on the stormprocessor github
>group [3] and elsewhere.
>
>The problem with this approach is that in certain cases it led to code rot
>(modules not being updated in step with Storm's API), fragmentation
>(multiple similar modules with the same name), and confusion.
>
>A good example of this is the storm-kafka module [4], since it is a widely
>used component. Because storm-contrib wasn't being tagged in github, a lot
>of users had trouble reconciling with which versions of storm it was
>compatible. Some users built off specific commit hashes, some forked, and
>a
>few even pushed custom builds to repositories such as clojars. With kafka
>0.8 now available, there are two main storm-kafka projects, the original
>(compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
>0.8).
>
>My intention is not to find fault in any way, but rather to point out the
>resulting pain, and work toward a better solution.
>
>I think it would be beneficial to the Storm user community to have certain
>commonly used modules like storm-kafka brought into the Apache Storm
>project. Another benefit worth considering is the licensing/legal
>oversight
>that the ASF provides, which is important to many users.
>
>If this is something we want to do, then the big question becomes what
>sort
>governance process needs to be established to ensure that such things are
>properly maintained.
>
>Some random thoughts, questions, etc. that jump to mind include:
>
>What to call these things: "contib modules", "connectors", "integration
>modules", etc.?
>Build integration: I imagine they would be a multi-module submodule of the
>main maven build. Probably turned off by default and enabled by a maven
>profile.
>Governance: Have one or more committer volunteers responsible for
>maintenance, merging patches, etc.? Proposal process for pulling new
>modules?
>
>
>I look forward to hearing others' opinions.
>
>- Taylor
>
>
>[1] https://issues.apache.org/jira/browse/STORM-206
>[2] https://github.com/nathanmarz/storm-contrib
>[3] https://github.com/stormprocessor
>[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>[5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
>--
>Milinda Pathirage
>
>PhD Student | Research Assistant
>School of Informatics and Computing | Data to Insight Center
>Indiana University
>
>twitter: milindalakmal
>skype: milinda.pathirage
>blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
>--
>Twitter: @nathanmarz
>http://nathanmarz.com<http://nathanmarz.com/>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Sean Zhong(clockfly)" <cl...@gmail.com>.

IMHO, storm-yarn and storm-starter, should be brought in first.

storm-start is simple, easy to maintain, and servers a good startpoint.
storm-yarn is necessary to work with HADOOP2.

With these, the user can immediately have a workable storm cluster on YARN,
so it is more basic.

Storm connectors like Cassandra, HBase, and etc  are also very important,
but introducing them in also means complex version dependancy on other
products. If there is upgrade on the upper stream, we need to update storm
also, I think It will impact the evolution speed of storm itself. It means
more effort for developer because now I need to make sure that every
checkin will not break the functon of other modules, and if UT fais, we
need to solve them.

Beside this, which upstream product version to pick? If the built-in
version is in-compatible with my production environment, what I can do,
should I remove the dependency of storm manually? It not only is about
whether we can find the right people to maintain them, it is also about
whether storm can be adoptted by multiple and diverse production
environments.

To help us easier to find the right connector for current storm version,
will a clear document work instead of bringing them in? With document, we
only need to check and sync at release time, instead of maintaining daily
compatibility.


 Sean

On Sat, Mar 1, 2014 at 4:40 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I am also happy to help maintain HDFS, HBase, and JMS related modules.
>  Perhaps it is best to pull in a few of these modules and see how things
> go, before we continue the discussion about other more complicated pieces.
>
> --Bobby
>
> From: "P. Taylor Goetz" <pt...@gmail.com>>
> Reply-To: <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 4:22 PM
> To: <dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> >>
> Cc: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>" <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> I purposely left out storm-starter from the discussion to keep things
> focused, and because it's a different animal. But I also feel it should be
> pulled in, albeit differently. I was thinking something along the lines of
> an "examples" directory, and that all committers would share collective
> ownership/responsibility.
>
> I haven't thought to much yet about the others (storm-yarn, etc.), but I
> think that warrants a discussion as well.
>
> Personally, I'd be willing to sponsor modules for Cassandra, HDFS, HBase,
> and JMS.
>
> I also contacted the author of storm-kafka-0.8-plus, and he is willing to
> contribute that work and help with maintenance.
>
> Regarding the juju charms issue [1], my intent wasn't to shoot it down
> entirely (which is why I left it open), but rather make it clear that it's
> not a priority at this point in time. I'll admit that it was a bit of a
> knee-jerk reaction to the fact that someone from Canonical essentially
> spammed a bunch of Apache projects with the same request. It also seemed
> not unlike a request for us to maintain .rpm and .deb packages,  etc.,
> which is a path I'd be very hesitant to go down.
>
> - Taylor
>
> [1] https://issues.apache.org/jira/browse/STORM-240
>
> On Feb 26, 2014, at 4:25 PM, Bobby Evans <evans@yahoo-inc.com<mailto:
> evans@yahoo-inc.com>> wrote:
>
> I totally agree and I am +1 on bringing these spout/trident pieces in,
> assuming there are committers to support them.
>
> I am also curious about how people feel about pulling in other projects
> like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
> Storm-starter in my option seems more like documentation and it would be
> nice to pull in so that it stays up to date with storm itself, just like
> the documentation.
>
> The others are more of ways to run storm in different environments.  They
> seem like there could be a lot of coupling between them and storm as storm
> evolves, and they kind of fit with "integrate storm with *Technology X*"
> except X in this case is a compute environment instead of a data source or
> store. But then again we also just shot down a request to create juju
> charms for storm.
>
> --Bobby
>
> From: "P. Taylor Goetz" <ptgoetz@gmail.com<mailto:ptgoetz@gmail.com
> ><ma...@gmail.com>>
> Reply-To: <dev@storm.incubator.apache.org<mailto:
> dev@storm.incubator.apache.org><ma...@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 1:21 PM
> To: <dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Cc: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> Thanks for the feedback Bobby.
>
> To clarify, I'm mainly talking about spout/bolt/trident state
> implementations that integrate storm with *Technology X*, where *Technology
> X* is not a fundamental part of storm.
>
> Examples would be technologies that are part of or related to the
> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka,
> HDFS, HBase, Cassandra, etc.
>
> The idea behind having one or more Storm committers act as a "sponsor" is
> to make sure new additions are done carefully and with good reason. To add
> a new module, it would require committer/PPMC consensus, and assignment of
> one or more sponsors. Part of a sponsor's job would be to ensure that a
> module is maintained, which would require enough familiarity with the code
> so support it long term. If a new module was proposed, but no committers
> were willing to act as a sponsor, it would not be added.
>
> It would be the Committers'/PPMC's responsibly to make sure things didn't
> get out of hand, and to do something about it if it does.
>
> Here's an old Hadoop JIRA thread [1] discussing the addition of Hive as a
> contrib module, similar to what happened with HBase as Bobby pointed out.
> Some interesting points are brought up. The difference here is that both
> HBase and Hive were pretty big codebases relative to Hadoop. With
> spout/bolt/state implementations I doubt we'd see anything along that scale.
>
> - Taylor
>
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com<mailto:
> evans@yahoo-inc.com><ma...@yahoo-inc.com>> wrote:
>
> I can see a lot of value in having a distribution of storm that comes with
> batteries included, everything is tested together and you know it works.
>  But I don't see much long term developer benefit in building them all
> together.  If there is strong coupling between storm and these external
> projects so that they break when storm changes then we need to understand
> the coupling and decide if we want to reduce that coupling by stabilizing
> APIs, improving version numbering and release process, etc.; or if the
> functionality is something that should be offered as a base service in
> storm.
>
> I can see politically the value of giving these other projects a home in
> Apache, and making them sub-projects is the simplest route to that.  I'd
> love to have storm on yarn inside Apache.  I just don't want to go
> overboard with it.  There was a time when HBase was a "contrib" module
> under Hadoop along with a lot of other things, and the Apache board came
> and told Hadoop to brake it up.
>
> Bringing storm-kafka into storm does not sound like it will solve much
> from a developer's perspective, because there is at least as much coupling
> with kafka as there is with storm.  I can see how it is a huge amount of
> overhead and pain to set up a new project just for a few hundred lines of
> code, as such I am in favor of pulling in closely related projects,
> especially those that are spouts and state implementations. I just want to
> be sure that we do it carefully, with a good reason, and with enough people
> who are familiar with the code to support it long term.
>
> If it starts to look like we are pulling in too many projects perhaps we
> should look at something more like the bigtop project
> https://bigtop.apache.org/ which produces a tested distribution of Hadoop
> with many different sub-projects included in it.
>
> I am also a bit concerned about these sub-projects becoming second class
> citizens, where we break something, but because the build is off by default
> we don't know it.  I would prefer that they are built and tested by
> default.  If the build and test time starts to take too long, to me that
> means we need to start wondering if we have too many contrib modules.
>
> --Bobby
>
> From: Brian Enochson <brian.enochson@gmail.com<mailto:
> brian.enochson@gmail.com><ma...@gmail.com><mailto:
> brian.enochson@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>" <
> user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>" <
> user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>" <dev@storm.incubator.apache.org<mailto:
> dev@storm.incubator.apache.org><mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> hi,
>  I am in agreement with Taylor and believe I understand his intent. An
> incredible tool/framework/application like Storm is only enhanced and gains
> value from the number of well maintained and vetted modules that can be
> used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some time
> reviewing contributing modules out there, reviewing various duplicates and
> running into some version incompatibilities. I understand the need to keep
> Storm itself pure, but do think there needs to be some structure and
> governance added to the contributing modules. Look at the benefit a tool
> like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as sure
> many would be, am willing to offer support and time to working through how
> to set this up and helping with the implementation if it is decided to
> pursue some solution.
> I hope these views are taken in the sprit they are made, to make this
> incredible system even better along with the surrounding eco-system.
>
> Thanks,
> Brian
>
>
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com><mailto:
> ptgoetz@gmail.com>> wrote:
> Just to be clear (and play a little Devil's advocate :) ), I'm not
> suggesting that whatever a "contrib" project/module/subproject might
>  become, be a clearinghouse for anything Storm-related.
>
> I see it as something that is well-vetted by the Storm community, subject
> to PPMC review, vote, etc. Entry would require community review, PPMC
> review, and in some cases ASF IP clearance/legal review. Anything added
> would require some level of commitment from the PPMC/committers to provide
> some level of support.
>
> In other words, nothing "willy-nilly".
>
> One option could be that any module added require (X > 0)  number of
> committers to volunteer as "sponsor"s for the module, and commit to
> maintaining it.
>
> That being said, I don't see storm-kafka being any different from anything
> else that provides integration points for Storm.
>
> -Taylor
>
>
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com<mailto:
> nathan@nathanmarz.com><ma...@nathanmarz.com><mailto:
> nathan@nathanmarz.com>> wrote:
>
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put
> these contrib modules in a "contrib" folder and keep them managed as
> completely separate codebases. As it's not actually a "module" necessary
> for Storm, there's an argument there for doing it that way rather than via
> the multi-module route.
>
>
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mpathira@umail.iu.edu
> <ma...@umail.iu.edu><mailto:
> mpathira@umail.iu.edu>> wrote:
> Hi Taylor,
>
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
>
> Thanks
> Milinda
>
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com><mailto:
> ptgoetz@gmail.com>> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
>
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
>
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
>
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
>
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
>
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
>
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
>
> Some random thoughts, questions, etc. that jump to mind include:
>
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
>
>
> I look forward to hearing others' opinions.
>
> - Taylor
>
>
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "Sean Zhong(clockfly)" <cl...@gmail.com>.

IMHO, storm-yarn and storm-starter, should be brought in first.

storm-start is simple, easy to maintain, and servers a good startpoint.
storm-yarn is necessary to work with HADOOP2.

With these, the user can immediately have a workable storm cluster on YARN,
so it is more basic.

Storm connectors like Cassandra, HBase, and etc  are also very important,
but introducing them in also means complex version dependancy on other
products. If there is upgrade on the upper stream, we need to update storm
also, I think It will impact the evolution speed of storm itself. It means
more effort for developer because now I need to make sure that every
checkin will not break the functon of other modules, and if UT fais, we
need to solve them.

Beside this, which upstream product version to pick? If the built-in
version is in-compatible with my production environment, what I can do,
should I remove the dependency of storm manually? It not only is about
whether we can find the right people to maintain them, it is also about
whether storm can be adoptted by multiple and diverse production
environments.

To help us easier to find the right connector for current storm version,
will a clear document work instead of bringing them in? With document, we
only need to check and sync at release time, instead of maintaining daily
compatibility.


 Sean

On Sat, Mar 1, 2014 at 4:40 AM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I am also happy to help maintain HDFS, HBase, and JMS related modules.
>  Perhaps it is best to pull in a few of these modules and see how things
> go, before we continue the discussion about other more complicated pieces.
>
> --Bobby
>
> From: "P. Taylor Goetz" <pt...@gmail.com>>
> Reply-To: <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 4:22 PM
> To: <dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> >>
> Cc: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>" <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> I purposely left out storm-starter from the discussion to keep things
> focused, and because it's a different animal. But I also feel it should be
> pulled in, albeit differently. I was thinking something along the lines of
> an "examples" directory, and that all committers would share collective
> ownership/responsibility.
>
> I haven't thought to much yet about the others (storm-yarn, etc.), but I
> think that warrants a discussion as well.
>
> Personally, I'd be willing to sponsor modules for Cassandra, HDFS, HBase,
> and JMS.
>
> I also contacted the author of storm-kafka-0.8-plus, and he is willing to
> contribute that work and help with maintenance.
>
> Regarding the juju charms issue [1], my intent wasn't to shoot it down
> entirely (which is why I left it open), but rather make it clear that it's
> not a priority at this point in time. I'll admit that it was a bit of a
> knee-jerk reaction to the fact that someone from Canonical essentially
> spammed a bunch of Apache projects with the same request. It also seemed
> not unlike a request for us to maintain .rpm and .deb packages,  etc.,
> which is a path I'd be very hesitant to go down.
>
> - Taylor
>
> [1] https://issues.apache.org/jira/browse/STORM-240
>
> On Feb 26, 2014, at 4:25 PM, Bobby Evans <evans@yahoo-inc.com<mailto:
> evans@yahoo-inc.com>> wrote:
>
> I totally agree and I am +1 on bringing these spout/trident pieces in,
> assuming there are committers to support them.
>
> I am also curious about how people feel about pulling in other projects
> like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
> Storm-starter in my option seems more like documentation and it would be
> nice to pull in so that it stays up to date with storm itself, just like
> the documentation.
>
> The others are more of ways to run storm in different environments.  They
> seem like there could be a lot of coupling between them and storm as storm
> evolves, and they kind of fit with "integrate storm with *Technology X*"
> except X in this case is a compute environment instead of a data source or
> store. But then again we also just shot down a request to create juju
> charms for storm.
>
> --Bobby
>
> From: "P. Taylor Goetz" <ptgoetz@gmail.com<mailto:ptgoetz@gmail.com
> ><ma...@gmail.com>>
> Reply-To: <dev@storm.incubator.apache.org<mailto:
> dev@storm.incubator.apache.org><ma...@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 1:21 PM
> To: <dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Cc: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><ma...@storm.incubator.apache.org>"
> <user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> Thanks for the feedback Bobby.
>
> To clarify, I'm mainly talking about spout/bolt/trident state
> implementations that integrate storm with *Technology X*, where *Technology
> X* is not a fundamental part of storm.
>
> Examples would be technologies that are part of or related to the
> Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka,
> HDFS, HBase, Cassandra, etc.
>
> The idea behind having one or more Storm committers act as a "sponsor" is
> to make sure new additions are done carefully and with good reason. To add
> a new module, it would require committer/PPMC consensus, and assignment of
> one or more sponsors. Part of a sponsor's job would be to ensure that a
> module is maintained, which would require enough familiarity with the code
> so support it long term. If a new module was proposed, but no committers
> were willing to act as a sponsor, it would not be added.
>
> It would be the Committers'/PPMC's responsibly to make sure things didn't
> get out of hand, and to do something about it if it does.
>
> Here's an old Hadoop JIRA thread [1] discussing the addition of Hive as a
> contrib module, similar to what happened with HBase as Bobby pointed out.
> Some interesting points are brought up. The difference here is that both
> HBase and Hive were pretty big codebases relative to Hadoop. With
> spout/bolt/state implementations I doubt we'd see anything along that scale.
>
> - Taylor
>
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <evans@yahoo-inc.com<mailto:
> evans@yahoo-inc.com><ma...@yahoo-inc.com>> wrote:
>
> I can see a lot of value in having a distribution of storm that comes with
> batteries included, everything is tested together and you know it works.
>  But I don't see much long term developer benefit in building them all
> together.  If there is strong coupling between storm and these external
> projects so that they break when storm changes then we need to understand
> the coupling and decide if we want to reduce that coupling by stabilizing
> APIs, improving version numbering and release process, etc.; or if the
> functionality is something that should be offered as a base service in
> storm.
>
> I can see politically the value of giving these other projects a home in
> Apache, and making them sub-projects is the simplest route to that.  I'd
> love to have storm on yarn inside Apache.  I just don't want to go
> overboard with it.  There was a time when HBase was a "contrib" module
> under Hadoop along with a lot of other things, and the Apache board came
> and told Hadoop to brake it up.
>
> Bringing storm-kafka into storm does not sound like it will solve much
> from a developer's perspective, because there is at least as much coupling
> with kafka as there is with storm.  I can see how it is a huge amount of
> overhead and pain to set up a new project just for a few hundred lines of
> code, as such I am in favor of pulling in closely related projects,
> especially those that are spouts and state implementations. I just want to
> be sure that we do it carefully, with a good reason, and with enough people
> who are familiar with the code to support it long term.
>
> If it starts to look like we are pulling in too many projects perhaps we
> should look at something more like the bigtop project
> https://bigtop.apache.org/ which produces a tested distribution of Hadoop
> with many different sub-projects included in it.
>
> I am also a bit concerned about these sub-projects becoming second class
> citizens, where we break something, but because the build is off by default
> we don't know it.  I would prefer that they are built and tested by
> default.  If the build and test time starts to take too long, to me that
> means we need to start wondering if we have too many contrib modules.
>
> --Bobby
>
> From: Brian Enochson <brian.enochson@gmail.com<mailto:
> brian.enochson@gmail.com><ma...@gmail.com><mailto:
> brian.enochson@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>" <
> user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org><mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>" <
> user@storm.incubator.apache.org<mailto:user@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> user@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org><mailto:
> dev@storm.incubator.apache.org>" <dev@storm.incubator.apache.org<mailto:
> dev@storm.incubator.apache.org><mailto:dev@storm.incubator.apache.org
> ><ma...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> hi,
>  I am in agreement with Taylor and believe I understand his intent. An
> incredible tool/framework/application like Storm is only enhanced and gains
> value from the number of well maintained and vetted modules that can be
> used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some time
> reviewing contributing modules out there, reviewing various duplicates and
> running into some version incompatibilities. I understand the need to keep
> Storm itself pure, but do think there needs to be some structure and
> governance added to the contributing modules. Look at the benefit a tool
> like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as sure
> many would be, am willing to offer support and time to working through how
> to set this up and helping with the implementation if it is decided to
> pursue some solution.
> I hope these views are taken in the sprit they are made, to make this
> incredible system even better along with the surrounding eco-system.
>
> Thanks,
> Brian
>
>
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com><mailto:
> ptgoetz@gmail.com>> wrote:
> Just to be clear (and play a little Devil's advocate :) ), I'm not
> suggesting that whatever a "contrib" project/module/subproject might
>  become, be a clearinghouse for anything Storm-related.
>
> I see it as something that is well-vetted by the Storm community, subject
> to PPMC review, vote, etc. Entry would require community review, PPMC
> review, and in some cases ASF IP clearance/legal review. Anything added
> would require some level of commitment from the PPMC/committers to provide
> some level of support.
>
> In other words, nothing "willy-nilly".
>
> One option could be that any module added require (X > 0)  number of
> committers to volunteer as "sponsor"s for the module, and commit to
> maintaining it.
>
> That being said, I don't see storm-kafka being any different from anything
> else that provides integration points for Storm.
>
> -Taylor
>
>
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com<mailto:
> nathan@nathanmarz.com><ma...@nathanmarz.com><mailto:
> nathan@nathanmarz.com>> wrote:
>
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put
> these contrib modules in a "contrib" folder and keep them managed as
> completely separate codebases. As it's not actually a "module" necessary
> for Storm, there's an argument there for doing it that way rather than via
> the multi-module route.
>
>
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mpathira@umail.iu.edu
> <ma...@umail.iu.edu><mailto:
> mpathira@umail.iu.edu>> wrote:
> Hi Taylor,
>
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
>
> Thanks
> Milinda
>
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com><mailto:
> ptgoetz@gmail.com>> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
>
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
>
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
>
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
>
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
>
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
>
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
>
> Some random thoughts, questions, etc. that jump to mind include:
>
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
>
>
> I look forward to hearing others' opinions.
>
> - Taylor
>
>
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Bobby Evans <ev...@yahoo-inc.com>.

I am also happy to help maintain HDFS, HBase, and JMS related modules.  Perhaps it is best to pull in a few of these modules and see how things go, before we continue the discussion about other more complicated pieces.

—Bobby

From: "P. Taylor Goetz" <pt...@gmail.com>>
Reply-To: <us...@storm.incubator.apache.org>>
Date: Wednesday, February 26, 2014 at 4:22 PM
To: <de...@storm.incubator.apache.org>>
Cc: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

I purposely left out storm-starter from the discussion to keep things focused, and because it’s a different animal. But I also feel it should be pulled in, albeit differently. I was thinking something along the lines of an “examples” directory, and that all committers would share collective ownership/responsibility.

I haven’t thought to much yet about the others (storm-yarn, etc.), but I think that warrants a discussion as well.

Personally, I’d be willing to sponsor modules for Cassandra, HDFS, HBase, and JMS.

I also contacted the author of storm-kafka-0.8-plus, and he is willing to contribute that work and help with maintenance.

Regarding the juju charms issue [1], my intent wasn’t to shoot it down entirely (which is why I left it open), but rather make it clear that it’s not a priority at this point in time. I’ll admit that it was a bit of a knee-jerk reaction to the fact that someone from Canonical essentially spammed a bunch of Apache projects with the same request. It also seemed not unlike a request for us to maintain .rpm and .deb packages,  etc., which is a path I’d be very hesitant to go down.

- Taylor

[1] https://issues.apache.org/jira/browse/STORM-240

On Feb 26, 2014, at 4:25 PM, Bobby Evans <ev...@yahoo-inc.com>> wrote:

I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them.

I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn?

Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation.

The others are more of ways to run storm in different environments.  They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with "integrate storm with *Technology X*” except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm.

—Bobby

From: "P. Taylor Goetz" <pt...@gmail.com>>
Reply-To: <de...@storm.incubator.apache.org>>
Date: Wednesday, February 26, 2014 at 1:21 PM
To: <de...@storm.incubator.apache.org>>
Cc: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Thanks for the feedback Bobby.

To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm.

Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.

The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.

It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.

Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.

- Taylor

[1] https://issues.apache.org/jira/browse/HADOOP-3601

On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com>> wrote:

I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.

I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.

Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.

If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.

I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.

—Bobby

From: Brian Enochson <br...@gmail.com>>
Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Date: Tuesday, February 25, 2014 at 9:50 PM
To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

hi,
 I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.

Thanks,
Brian

On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.

-Taylor

On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:

I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.

On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
A while back I opened STORM-206 [1] to capture ideas for pulling in
"contrib" modules to the Apache codebase.

In the past, we had the storm-contrib github project [2] which subsequently
got broken up into individual projects hosted on the stormprocessor github
group [3] and elsewhere.

The problem with this approach is that in certain cases it led to code rot
(modules not being updated in step with Storm's API), fragmentation
(multiple similar modules with the same name), and confusion.

A good example of this is the storm-kafka module [4], since it is a widely
used component. Because storm-contrib wasn't being tagged in github, a lot
of users had trouble reconciling with which versions of storm it was
compatible. Some users built off specific commit hashes, some forked, and a
few even pushed custom builds to repositories such as clojars. With kafka
0.8 now available, there are two main storm-kafka projects, the original
(compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
0.8).

My intention is not to find fault in any way, but rather to point out the
resulting pain, and work toward a better solution.

I think it would be beneficial to the Storm user community to have certain
commonly used modules like storm-kafka brought into the Apache Storm
project. Another benefit worth considering is the licensing/legal oversight
that the ASF provides, which is important to many users.

If this is something we want to do, then the big question becomes what sort
governance process needs to be established to ensure that such things are
properly maintained.

Some random thoughts, questions, etc. that jump to mind include:

What to call these things: "contib modules", "connectors", "integration
modules", etc.?
Build integration: I imagine they would be a multi-module submodule of the
main maven build. Probably turned off by default and enabled by a maven
profile.
Governance: Have one or more committer volunteers responsible for
maintenance, merging patches, etc.? Proposal process for pulling new
modules?

I look forward to hearing others' opinions.

- Taylor

[1] https://issues.apache.org/jira/browse/STORM-206
[2] https://github.com/nathanmarz/storm-contrib
[3] https://github.com/stormprocessor
[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus

--
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>

--
Twitter: @nathanmarz
http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

I purposely left out storm-starter from the discussion to keep things focused, and because it’s a different animal. But I also feel it should be pulled in, albeit differently. I was thinking something along the lines of an “examples” directory, and that all committers would share collective ownership/responsibility.

I haven’t thought to much yet about the others (storm-yarn, etc.), but I think that warrants a discussion as well.

Personally, I’d be willing to sponsor modules for Cassandra, HDFS, HBase, and JMS.

I also contacted the author of storm-kafka-0.8-plus, and he is willing to contribute that work and help with maintenance.

Regarding the juju charms issue [1], my intent wasn’t to shoot it down entirely (which is why I left it open), but rather make it clear that it’s not a priority at this point in time. I’ll admit that it was a bit of a knee-jerk reaction to the fact that someone from Canonical essentially spammed a bunch of Apache projects with the same request. It also seemed not unlike a request for us to maintain .rpm and .deb packages,  etc., which is a path I’d be very hesitant to go down.

- Taylor

[1] https://issues.apache.org/jira/browse/STORM-240

On Feb 26, 2014, at 4:25 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them.
> 
> I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
> 
> Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation.
> 
> The others are more of ways to run storm in different environments.  They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with "integrate storm with *Technology X*” except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm.
> 
> —Bobby
> 
> From: "P. Taylor Goetz" <pt...@gmail.com>>
> Reply-To: <de...@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 1:21 PM
> To: <de...@storm.incubator.apache.org>>
> Cc: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> Thanks for the feedback Bobby.
> 
> To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm.
> 
> Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.
> 
> The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.
> 
> It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.
> 
> Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.
> 
> - Taylor
> 
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> 
> 
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com>> wrote:
> 
> I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.
> 
> I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.
> 
> Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.
> 
> If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.
> 
> I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.
> 
> —Bobby
> 
> From: Brian Enochson <br...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> hi,
>  I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.
> 
> Thanks,
> Brian
> 
> 
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.
> 
> I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.
> 
> In other words, nothing “willy-nilly”.
> 
> One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.
> 
> That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.
> 
> -Taylor
> 
> 
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:
> 
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
> 
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
> 
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
> 
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
> 
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
> 
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
> 
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
> 
> Some random thoughts, questions, etc. that jump to mind include:
> 
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
> 
> 
> I look forward to hearing others' opinions.
> 
> - Taylor
> 
> 
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
> 
> 
> 
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

I purposely left out storm-starter from the discussion to keep things focused, and because it’s a different animal. But I also feel it should be pulled in, albeit differently. I was thinking something along the lines of an “examples” directory, and that all committers would share collective ownership/responsibility.

I haven’t thought to much yet about the others (storm-yarn, etc.), but I think that warrants a discussion as well.

Personally, I’d be willing to sponsor modules for Cassandra, HDFS, HBase, and JMS.

I also contacted the author of storm-kafka-0.8-plus, and he is willing to contribute that work and help with maintenance.

Regarding the juju charms issue [1], my intent wasn’t to shoot it down entirely (which is why I left it open), but rather make it clear that it’s not a priority at this point in time. I’ll admit that it was a bit of a knee-jerk reaction to the fact that someone from Canonical essentially spammed a bunch of Apache projects with the same request. It also seemed not unlike a request for us to maintain .rpm and .deb packages,  etc., which is a path I’d be very hesitant to go down.

- Taylor

[1] https://issues.apache.org/jira/browse/STORM-240

On Feb 26, 2014, at 4:25 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them.
> 
> I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
> 
> Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation.
> 
> The others are more of ways to run storm in different environments.  They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with "integrate storm with *Technology X*” except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm.
> 
> —Bobby
> 
> From: "P. Taylor Goetz" <pt...@gmail.com>>
> Reply-To: <de...@storm.incubator.apache.org>>
> Date: Wednesday, February 26, 2014 at 1:21 PM
> To: <de...@storm.incubator.apache.org>>
> Cc: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> Thanks for the feedback Bobby.
> 
> To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm.
> 
> Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.
> 
> The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.
> 
> It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.
> 
> Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.
> 
> - Taylor
> 
> [1] https://issues.apache.org/jira/browse/HADOOP-3601
> 
> 
> On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com>> wrote:
> 
> I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.
> 
> I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.
> 
> Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.
> 
> If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.
> 
> I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.
> 
> —Bobby
> 
> From: Brian Enochson <br...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> hi,
>  I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.
> 
> Thanks,
> Brian
> 
> 
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.
> 
> I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.
> 
> In other words, nothing “willy-nilly”.
> 
> One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.
> 
> That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.
> 
> -Taylor
> 
> 
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:
> 
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
> 
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
> 
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
> 
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
> 
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
> 
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
> 
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
> 
> Some random thoughts, questions, etc. that jump to mind include:
> 
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
> 
> 
> I look forward to hearing others' opinions.
> 
> - Taylor
> 
> 
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
> 
> 
> 
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

Bobby,

FWIW, I¹d love to see storm-yarn inside.  I think we could definitely make
things easier on the end-user if they were more cohesive.

e.g. Imagine if we had ³storm launch yarn² inside of $storm/bin that would
kickoff a storm-yarn launch, with whatever version was built.  It would
likely simplify the ³create-tarball² and storm-yarn getStormConfig process
as well.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 2/26/14, 4:25 PM, "Bobby Evans" <ev...@yahoo-inc.com> wrote:

>I totally agree and I am +1 on bringing these spout/trident pieces in,
>assuming there are committers to support them.
>
>I am also curious about how people feel about pulling in other projects
>like storm-starter, storm-deploy, storm-mesos, and storm-yarn?
>
>Storm-starter in my option seems more like documentation and it would be
>nice to pull in so that it stays up to date with storm itself, just like
>the documentation.
>
>The others are more of ways to run storm in different environments.  They
>seem like there could be a lot of coupling between them and storm as
>storm evolves, and they kind of fit with "integrate storm with
>*Technology X*² except X in this case is a compute environment instead of
>a data source or store. But then again we also just shot down a request
>to create juju charms for storm.
>
>Bobby
>
>From: "P. Taylor Goetz" <pt...@gmail.com>>
>Reply-To: 
><de...@storm.incubator.apache.org>>
>Date: Wednesday, February 26, 2014 at 1:21 PM
>To: 
><de...@storm.incubator.apache.org>>
>Cc: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>Thanks for the feedback Bobby.
>
>To clarify, I¹m mainly talking about spout/bolt/trident state
>implementations that integrate storm with *Technology X*, where
>*Technology X* is not a fundamental part of storm.
>
>Examples would be technologies that are part of or related to the
>Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka,
>HDFS, HBase, Cassandra, etc.
>
>The idea behind having one or more Storm committers act as a ³sponsor² is
>to make sure new additions are done carefully and with good reason. To
>add a new module, it would require committer/PPMC consensus, and
>assignment of one or more sponsors. Part of a sponsor¹s job would be to
>ensure that a module is maintained, which would require enough
>familiarity with the code so support it long term. If a new module was
>proposed, but no committers were willing to act as a sponsor, it would
>not be added.
>
>It would be the Committers¹/PPMC¹s responsibly to make sure things didn¹t
>get out of hand, and to do something about it if it does.
>
>Here¹s an old Hadoop JIRA thread [1] discussing the addition of Hive as a
>contrib module, similar to what happened with HBase as Bobby pointed out.
>Some interesting points are brought up. The difference here is that both
>HBase and Hive were pretty big codebases relative to Hadoop. With
>spout/bolt/state implementations I doubt we¹d see anything along that
>scale.
>
>- Taylor
>
>[1] https://issues.apache.org/jira/browse/HADOOP-3601
>
>
>On Feb 26, 2014, at 12:35 PM, Bobby Evans
><ev...@yahoo-inc.com>> wrote:
>
>I can see a lot of value in having a distribution of storm that comes
>with batteries included, everything is tested together and you know it
>works.  But I don¹t see much long term developer benefit in building them
>all together.  If there is strong coupling between storm and these
>external projects so that they break when storm changes then we need to
>understand the coupling and decide if we want to reduce that coupling by
>stabilizing APIs, improving version numbering and release process, etc.;
>or if the functionality is something that should be offered as a base
>service in storm.
>
>I can see politically the value of giving these other projects a home in
>Apache, and making them sub-projects is the simplest route to that.  I¹d
>love to have storm on yarn inside Apache.  I just don¹t want to go
>overboard with it.  There was a time when HBase was a ³contrib² module
>under Hadoop along with a lot of other things, and the Apache board came
>and told Hadoop to brake it up.
>
>Bringing storm-kafka into storm does not sound like it will solve much
>from a developer¹s perspective, because there is at least as much
>coupling with kafka as there is with storm.  I can see how it is a huge
>amount of overhead and pain to set up a new project just for a few
>hundred lines of code, as such I am in favor of pulling in closely
>related projects, especially those that are spouts and state
>implementations. I just want to be sure that we do it carefully, with a
>good reason, and with enough people who are familiar with the code to
>support it long term.
>
>If it starts to look like we are pulling in too many projects perhaps we
>should look at something more like the bigtop project
>https://bigtop.apache.org/ which produces a tested distribution of Hadoop
>with many different sub-projects included in it.
>
>I am also a bit concerned about these sub-projects becoming second class
>citizens, where we break something, but because the build is off by
>default we don¹t know it.  I would prefer that they are built and tested
>by default.  If the build and test time starts to take too long, to me
>that means we need to start wondering if we have too many contrib modules.
>
>Bobby
>
>From: Brian Enochson
><br...@gmail.com><mailto:brian.en
>ochson@gmail.com>>
>Reply-To: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>>
>Date: Tuesday, February 25, 2014 at 9:50 PM
>To: 
>"user@storm.incubator.apache.org<ma...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>"
><us...@storm.incubator.apache.org><m
>ailto:user@storm.incubator.apache.org>>
>Cc: 
>"dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org><mai
>lto:dev@storm.incubator.apache.org>"
><de...@storm.incubator.apache.org><mai
>lto:dev@storm.incubator.apache.org>>
>Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
>hi,
>  I am in agreement with Taylor and believe I understand his intent. An
>incredible tool/framework/application like Storm is only enhanced and
>gains value from the number of well maintained and vetted modules that
>can be used for integration and adding further functionality.
> I am relatively new to the Storm community but have spent quite some
>time reviewing contributing modules out there, reviewing various
>duplicates and running into some version incompatibilities. I understand
>the need to keep Storm itself pure, but do think there needs to be some
>structure and governance added to the contributing modules. Look at the
>benefit a tool like npm brings to the node community.
> I like the idea of sponsorship, vetting and a community vote.  I, as
>sure many would be, am willing to offer support and time to working
>through how to set this up and helping with the implementation if it is
>decided to pursue some solution.
> I hope these views are taken in the sprit they are made, to make this
>incredible system even better along with the surrounding eco-system.
>
>Thanks,
>Brian
>
>
>On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz
><pt...@gmail.com>>
>wrote:
>Just to be clear (and play a little Devil¹s advocate :) ), I¹m not
>suggesting that whatever a ³contrib² project/module/subproject might
>become, be a clearinghouse for anything Storm-related.
>
>I see it as something that is well-vetted by the Storm community, subject
>to PPMC review, vote, etc. Entry would require community review, PPMC
>review, and in some cases ASF IP clearance/legal review. Anything added
>would require some level of commitment from the PPMC/committers to
>provide some level of support.
>
>In other words, nothing ³willy-nilly².
>
>One option could be that any module added require (X > 0)  number of
>committers to volunteer as ³sponsor²s for the module, and commit to
>maintaining it.
>
>That being said, I don¹t see storm-kafka being any different from
>anything else that provides integration points for Storm.
>
>-Taylor
>
>
>On Feb 25, 2014, at 7:53 PM, Nathan Marz
><na...@nathanmarz.com><mailto:nathan@nathanm
>arz.com>> wrote:
>
>I'm only +1 for pulling in storm-kafka and updating it. Other projects
>put these contrib modules in a "contrib" folder and keep them managed as
>completely separate codebases. As it's not actually a "module" necessary
>for Storm, there's an argument there for doing it that way rather than
>via the multi-module route.
>
>
>On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
><mp...@umail.iu.edu><mailto:mpathira@umail
>.iu.edu>> wrote:
>Hi Taylor,
>
>I'm +1 for pulling these external libraries into Apache codebase. This
>will certainly benifit Strom community. I also like to contribute to
>this process.
>
>Thanks
>Milinda
>
>On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz
><pt...@gmail.com>>
>wrote:
>A while back I opened STORM-206 [1] to capture ideas for pulling in
>"contrib" modules to the Apache codebase.
>
>In the past, we had the storm-contrib github project [2] which
>subsequently
>got broken up into individual projects hosted on the stormprocessor github
>group [3] and elsewhere.
>
>The problem with this approach is that in certain cases it led to code rot
>(modules not being updated in step with Storm's API), fragmentation
>(multiple similar modules with the same name), and confusion.
>
>A good example of this is the storm-kafka module [4], since it is a widely
>used component. Because storm-contrib wasn't being tagged in github, a lot
>of users had trouble reconciling with which versions of storm it was
>compatible. Some users built off specific commit hashes, some forked, and
>a
>few even pushed custom builds to repositories such as clojars. With kafka
>0.8 now available, there are two main storm-kafka projects, the original
>(compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
>0.8).
>
>My intention is not to find fault in any way, but rather to point out the
>resulting pain, and work toward a better solution.
>
>I think it would be beneficial to the Storm user community to have certain
>commonly used modules like storm-kafka brought into the Apache Storm
>project. Another benefit worth considering is the licensing/legal
>oversight
>that the ASF provides, which is important to many users.
>
>If this is something we want to do, then the big question becomes what
>sort
>governance process needs to be established to ensure that such things are
>properly maintained.
>
>Some random thoughts, questions, etc. that jump to mind include:
>
>What to call these things: "contib modules", "connectors", "integration
>modules", etc.?
>Build integration: I imagine they would be a multi-module submodule of the
>main maven build. Probably turned off by default and enabled by a maven
>profile.
>Governance: Have one or more committer volunteers responsible for
>maintenance, merging patches, etc.? Proposal process for pulling new
>modules?
>
>
>I look forward to hearing others' opinions.
>
>- Taylor
>
>
>[1] https://issues.apache.org/jira/browse/STORM-206
>[2] https://github.com/nathanmarz/storm-contrib
>[3] https://github.com/stormprocessor
>[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>[5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
>--
>Milinda Pathirage
>
>PhD Student | Research Assistant
>School of Informatics and Computing | Data to Insight Center
>Indiana University
>
>twitter: milindalakmal
>skype: milinda.pathirage
>blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
>--
>Twitter: @nathanmarz
>http://nathanmarz.com<http://nathanmarz.com/>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Bobby Evans <ev...@yahoo-inc.com>.

I totally agree and I am +1 on bringing these spout/trident pieces in, assuming there are committers to support them.

I am also curious about how people feel about pulling in other projects like storm-starter, storm-deploy, storm-mesos, and storm-yarn?

Storm-starter in my option seems more like documentation and it would be nice to pull in so that it stays up to date with storm itself, just like the documentation.

The others are more of ways to run storm in different environments.  They seem like there could be a lot of coupling between them and storm as storm evolves, and they kind of fit with "integrate storm with *Technology X*” except X in this case is a compute environment instead of a data source or store. But then again we also just shot down a request to create juju charms for storm.

—Bobby

From: "P. Taylor Goetz" <pt...@gmail.com>>
Reply-To: <de...@storm.incubator.apache.org>>
Date: Wednesday, February 26, 2014 at 1:21 PM
To: <de...@storm.incubator.apache.org>>
Cc: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Thanks for the feedback Bobby.

To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm.

Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.

The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.

It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.

Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.

- Taylor

[1] https://issues.apache.org/jira/browse/HADOOP-3601


On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com>> wrote:

I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.

I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.

Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.

If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.

I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.

—Bobby

From: Brian Enochson <br...@gmail.com>>
Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Date: Tuesday, February 25, 2014 at 9:50 PM
To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

hi,
  I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
 I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
 I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
 I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.

Thanks,
Brian


On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.

-Taylor


On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:

I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.


On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
A while back I opened STORM-206 [1] to capture ideas for pulling in
"contrib" modules to the Apache codebase.

In the past, we had the storm-contrib github project [2] which subsequently
got broken up into individual projects hosted on the stormprocessor github
group [3] and elsewhere.

The problem with this approach is that in certain cases it led to code rot
(modules not being updated in step with Storm's API), fragmentation
(multiple similar modules with the same name), and confusion.

A good example of this is the storm-kafka module [4], since it is a widely
used component. Because storm-contrib wasn't being tagged in github, a lot
of users had trouble reconciling with which versions of storm it was
compatible. Some users built off specific commit hashes, some forked, and a
few even pushed custom builds to repositories such as clojars. With kafka
0.8 now available, there are two main storm-kafka projects, the original
(compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
0.8).

My intention is not to find fault in any way, but rather to point out the
resulting pain, and work toward a better solution.

I think it would be beneficial to the Storm user community to have certain
commonly used modules like storm-kafka brought into the Apache Storm
project. Another benefit worth considering is the licensing/legal oversight
that the ASF provides, which is important to many users.

If this is something we want to do, then the big question becomes what sort
governance process needs to be established to ensure that such things are
properly maintained.

Some random thoughts, questions, etc. that jump to mind include:

What to call these things: "contib modules", "connectors", "integration
modules", etc.?
Build integration: I imagine they would be a multi-module submodule of the
main maven build. Probably turned off by default and enabled by a maven
profile.
Governance: Have one or more committer volunteers responsible for
maintenance, merging patches, etc.? Proposal process for pulling new
modules?


I look forward to hearing others' opinions.

- Taylor


[1] https://issues.apache.org/jira/browse/STORM-206
[2] https://github.com/nathanmarz/storm-contrib
[3] https://github.com/stormprocessor
[4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
[5] https://github.com/wurstmeister/storm-kafka-0.8-plus



--
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>



--
Twitter: @nathanmarz
http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Thanks for the feedback Bobby.

To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm. 

Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.

The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.

It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.

Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.

- Taylor

[1] https://issues.apache.org/jira/browse/HADOOP-3601


On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.
> 
> I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.
> 
> Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.
> 
> If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.
> 
> I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.
> 
> —Bobby
> 
> From: Brian Enochson <br...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> hi,
>   I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
>  I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
>  I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
>  I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.
> 
> Thanks,
> Brian
> 
> 
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.
> 
> I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.
> 
> In other words, nothing “willy-nilly”.
> 
> One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.
> 
> That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.
> 
> -Taylor
> 
> 
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:
> 
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
>> A while back I opened STORM-206 [1] to capture ideas for pulling in
>> "contrib" modules to the Apache codebase.
>> 
>> In the past, we had the storm-contrib github project [2] which subsequently
>> got broken up into individual projects hosted on the stormprocessor github
>> group [3] and elsewhere.
>> 
>> The problem with this approach is that in certain cases it led to code rot
>> (modules not being updated in step with Storm's API), fragmentation
>> (multiple similar modules with the same name), and confusion.
>> 
>> A good example of this is the storm-kafka module [4], since it is a widely
>> used component. Because storm-contrib wasn't being tagged in github, a lot
>> of users had trouble reconciling with which versions of storm it was
>> compatible. Some users built off specific commit hashes, some forked, and a
>> few even pushed custom builds to repositories such as clojars. With kafka
>> 0.8 now available, there are two main storm-kafka projects, the original
>> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
>> 0.8).
>> 
>> My intention is not to find fault in any way, but rather to point out the
>> resulting pain, and work toward a better solution.
>> 
>> I think it would be beneficial to the Storm user community to have certain
>> commonly used modules like storm-kafka brought into the Apache Storm
>> project. Another benefit worth considering is the licensing/legal oversight
>> that the ASF provides, which is important to many users.
>> 
>> If this is something we want to do, then the big question becomes what sort
>> governance process needs to be established to ensure that such things are
>> properly maintained.
>> 
>> Some random thoughts, questions, etc. that jump to mind include:
>> 
>> What to call these things: "contib modules", "connectors", "integration
>> modules", etc.?
>> Build integration: I imagine they would be a multi-module submodule of the
>> main maven build. Probably turned off by default and enabled by a maven
>> profile.
>> Governance: Have one or more committer volunteers responsible for
>> maintenance, merging patches, etc.? Proposal process for pulling new
>> modules?
>> 
>> 
>> I look forward to hearing others' opinions.
>> 
>> - Taylor
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/STORM-206
>> [2] https://github.com/nathanmarz/storm-contrib
>> [3] https://github.com/stormprocessor
>> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
> 
> 
> 
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Brian O'Neill <bo...@alumni.brown.edu>.

I¹ll pile on. (+1 to Robert¹s sentiments)

Taylor, just give the word and I can start to transition the IP for
storm-cassandra and storm-cassandra-cql.
I can also lend a hand supporting them.

-brian

---
Brian O'Neill
Chief Technology Officer

Health Market Science
The Science of Better Results
2700 Horizon Drive  King of Prussia, PA  19406
M: 215.588.6024  @boneill42 <http://www.twitter.com/boneill42>  
healthmarketscience.com

This information transmitted in this email message is for the intended
recipient only and may contain confidential and/or privileged material. If
you received this email in error and are not the intended recipient, or
the person responsible to deliver it to the intended recipient, please
contact the sender at the email above and delete this email and any
attachments and destroy any copies thereof. Any review, retransmission,
dissemination, copying or other use of, or taking any action in reliance
upon, this information by persons or entities other than the intended
recipient is strictly prohibited.
 






On 2/26/14, 1:43 PM, "Robert Lee" <le...@gmail.com> wrote:

>To build on Bobby's statement, it does pain me as a user to have to search
>outside of the project modules to find a compatible build that works with
>the latest version of storm as well as the latest module version. However,
>in instances such as hbase, cassandra, kafka, etc., I think these commonly
>used contrib projects should be pulled into storm if they meet stringent
>criteria of:
>
>1) Several volunteer developers familiar with code to update as new
>versions arise
>2) Fully implemented bolt/spout
>
>" If the build and test time starts to take too long, to me that means we
>need to start wondering if we have too many contrib modules." -- +1
>
>I would be willing to volunteer with the cassandra backing map module
>(especially with the latest CQL3 release).
>
>
>On Wed, Feb 26, 2014 at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:
>
>> I can see a lot of value in having a distribution of storm that comes
>>with
>> batteries included, everything is tested together and you know it works.
>>  But I don't see much long term developer benefit in building them all
>> together.  If there is strong coupling between storm and these external
>> projects so that they break when storm changes then we need to
>>understand
>> the coupling and decide if we want to reduce that coupling by
>>stabilizing
>> APIs, improving version numbering and release process, etc.; or if the
>> functionality is something that should be offered as a base service in
>> storm.
>>
>> I can see politically the value of giving these other projects a home in
>> Apache, and making them sub-projects is the simplest route to that.  I'd
>> love to have storm on yarn inside Apache.  I just don't want to go
>> overboard with it.  There was a time when HBase was a "contrib" module
>> under Hadoop along with a lot of other things, and the Apache board came
>> and told Hadoop to brake it up.
>>
>> Bringing storm-kafka into storm does not sound like it will solve much
>> from a developer's perspective, because there is at least as much
>>coupling
>> with kafka as there is with storm.  I can see how it is a huge amount of
>> overhead and pain to set up a new project just for a few hundred lines
>>of
>> code, as such I am in favor of pulling in closely related projects,
>> especially those that are spouts and state implementations. I just want
>>to
>> be sure that we do it carefully, with a good reason, and with enough
>>people
>> who are familiar with the code to support it long term.
>>
>> If it starts to look like we are pulling in too many projects perhaps we
>> should look at something more like the bigtop project
>> https://bigtop.apache.org/ which produces a tested distribution of
>>Hadoop
>> with many different sub-projects included in it.
>>
>> I am also a bit concerned about these sub-projects becoming second class
>> citizens, where we break something, but because the build is off by
>>default
>> we don't know it.  I would prefer that they are built and tested by
>> default.  If the build and test time starts to take too long, to me that
>> means we need to start wondering if we have too many contrib modules.
>>
>> --Bobby
>>
>> From: Brian Enochson <brian.enochson@gmail.com<mailto:
>> brian.enochson@gmail.com>>
>> Reply-To: "user@storm.incubator.apache.org<mailto:
>> user@storm.incubator.apache.org>"
>><user@storm.incubator.apache.org<mailto:
>> user@storm.incubator.apache.org>>
>> Date: Tuesday, February 25, 2014 at 9:50 PM
>> To: "user@storm.incubator.apache.org<mailto:
>> user@storm.incubator.apache.org>"
>><user@storm.incubator.apache.org<mailto:
>> user@storm.incubator.apache.org>>
>> Cc: 
>>"dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
>> <de...@storm.incubator.apache.org>>
>> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>>
>> hi,
>>    I am in agreement with Taylor and believe I understand his intent. An
>> incredible tool/framework/application like Storm is only enhanced and
>>gains
>> value from the number of well maintained and vetted modules that can be
>> used for integration and adding further functionality.
>>   I am relatively new to the Storm community but have spent quite some
>> time reviewing contributing modules out there, reviewing various
>>duplicates
>> and running into some version incompatibilities. I understand the need
>>to
>> keep Storm itself pure, but do think there needs to be some structure
>>and
>> governance added to the contributing modules. Look at the benefit a tool
>> like npm brings to the node community.
>>   I like the idea of sponsorship, vetting and a community vote.  I, as
>> sure many would be, am willing to offer support and time to working
>>through
>> how to set this up and helping with the implementation if it is decided
>>to
>> pursue some solution.
>>   I hope these views are taken in the sprit they are made, to make this
>> incredible system even better along with the surrounding eco-system.
>>
>> Thanks,
>> Brian
>>
>>
>> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <ptgoetz@gmail.com
>> <ma...@gmail.com>> wrote:
>> Just to be clear (and play a little Devil's advocate :) ), I'm not
>> suggesting that whatever a "contrib" project/module/subproject might
>>  become, be a clearinghouse for anything Storm-related.
>>
>> I see it as something that is well-vetted by the Storm community,
>>subject
>> to PPMC review, vote, etc. Entry would require community review, PPMC
>> review, and in some cases ASF IP clearance/legal review. Anything added
>> would require some level of commitment from the PPMC/committers to
>>provide
>> some level of support.
>>
>> In other words, nothing "willy-nilly".
>>
>> One option could be that any module added require (X > 0)  number of
>> committers to volunteer as "sponsor"s for the module, and commit to
>> maintaining it.
>>
>> That being said, I don't see storm-kafka being any different from
>>anything
>> else that provides integration points for Storm.
>>
>> -Taylor
>>
>>
>> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com<mailto:
>> nathan@nathanmarz.com>> wrote:
>>
>> I'm only +1 for pulling in storm-kafka and updating it. Other projects
>>put
>> these contrib modules in a "contrib" folder and keep them managed as
>> completely separate codebases. As it's not actually a "module" necessary
>> for Storm, there's an argument there for doing it that way rather than
>>via
>> the multi-module route.
>>
>>
>> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage
>><mpathira@umail.iu.edu
>> <ma...@umail.iu.edu>> wrote:
>> Hi Taylor,
>>
>> I'm +1 for pulling these external libraries into Apache codebase. This
>> will certainly benifit Strom community. I also like to contribute to
>> this process.
>>
>> Thanks
>> Milinda
>>
>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <ptgoetz@gmail.com
>> <ma...@gmail.com>> wrote:
>> > A while back I opened STORM-206 [1] to capture ideas for pulling in
>> > "contrib" modules to the Apache codebase.
>> >
>> > In the past, we had the storm-contrib github project [2] which
>> subsequently
>> > got broken up into individual projects hosted on the stormprocessor
>> github
>> > group [3] and elsewhere.
>> >
>> > The problem with this approach is that in certain cases it led to code
>> rot
>> > (modules not being updated in step with Storm's API), fragmentation
>> > (multiple similar modules with the same name), and confusion.
>> >
>> > A good example of this is the storm-kafka module [4], since it is a
>> widely
>> > used component. Because storm-contrib wasn't being tagged in github, a
>> lot
>> > of users had trouble reconciling with which versions of storm it was
>> > compatible. Some users built off specific commit hashes, some forked,
>> and a
>> > few even pushed custom builds to repositories such as clojars. With
>>kafka
>> > 0.8 now available, there are two main storm-kafka projects, the
>>original
>> > (compatible with kafka 0.7) and an updated fork [5] (compatible with
>> kafka
>> > 0.8).
>> >
>> > My intention is not to find fault in any way, but rather to point out
>>the
>> > resulting pain, and work toward a better solution.
>> >
>> > I think it would be beneficial to the Storm user community to have
>> certain
>> > commonly used modules like storm-kafka brought into the Apache Storm
>> > project. Another benefit worth considering is the licensing/legal
>> oversight
>> > that the ASF provides, which is important to many users.
>> >
>> > If this is something we want to do, then the big question becomes what
>> sort
>> > governance process needs to be established to ensure that such things
>>are
>> > properly maintained.
>> >
>> > Some random thoughts, questions, etc. that jump to mind include:
>> >
>> > What to call these things: "contib modules", "connectors",
>>"integration
>> > modules", etc.?
>> > Build integration: I imagine they would be a multi-module submodule of
>> the
>> > main maven build. Probably turned off by default and enabled by a
>>maven
>> > profile.
>> > Governance: Have one or more committer volunteers responsible for
>> > maintenance, merging patches, etc.? Proposal process for pulling new
>> > modules?
>> >
>> >
>> > I look forward to hearing others' opinions.
>> >
>> > - Taylor
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/STORM-206
>> > [2] https://github.com/nathanmarz/storm-contrib
>> > [3] https://github.com/stormprocessor
>> > [4] 
>>https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>>
>>
>>
>> --
>> Milinda Pathirage
>>
>> PhD Student | Research Assistant
>> School of Informatics and Computing | Data to Insight Center
>> Indiana University
>>
>> twitter: milindalakmal
>> skype: milinda.pathirage
>> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>>
>>
>>
>> --
>> Twitter: @nathanmarz
>> http://nathanmarz.com<http://nathanmarz.com/>
>>
>>
>>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Robert Lee <le...@gmail.com>.

To build on Bobby's statement, it does pain me as a user to have to search
outside of the project modules to find a compatible build that works with
the latest version of storm as well as the latest module version. However,
in instances such as hbase, cassandra, kafka, etc., I think these commonly
used contrib projects should be pulled into storm if they meet stringent
criteria of:

1) Several volunteer developers familiar with code to update as new
versions arise
2) Fully implemented bolt/spout

" If the build and test time starts to take too long, to me that means we
need to start wondering if we have too many contrib modules." -- +1

I would be willing to volunteer with the cassandra backing map module
(especially with the latest CQL3 release).


On Wed, Feb 26, 2014 at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I can see a lot of value in having a distribution of storm that comes with
> batteries included, everything is tested together and you know it works.
>  But I don't see much long term developer benefit in building them all
> together.  If there is strong coupling between storm and these external
> projects so that they break when storm changes then we need to understand
> the coupling and decide if we want to reduce that coupling by stabilizing
> APIs, improving version numbering and release process, etc.; or if the
> functionality is something that should be offered as a base service in
> storm.
>
> I can see politically the value of giving these other projects a home in
> Apache, and making them sub-projects is the simplest route to that.  I'd
> love to have storm on yarn inside Apache.  I just don't want to go
> overboard with it.  There was a time when HBase was a "contrib" module
> under Hadoop along with a lot of other things, and the Apache board came
> and told Hadoop to brake it up.
>
> Bringing storm-kafka into storm does not sound like it will solve much
> from a developer's perspective, because there is at least as much coupling
> with kafka as there is with storm.  I can see how it is a huge amount of
> overhead and pain to set up a new project just for a few hundred lines of
> code, as such I am in favor of pulling in closely related projects,
> especially those that are spouts and state implementations. I just want to
> be sure that we do it carefully, with a good reason, and with enough people
> who are familiar with the code to support it long term.
>
> If it starts to look like we are pulling in too many projects perhaps we
> should look at something more like the bigtop project
> https://bigtop.apache.org/ which produces a tested distribution of Hadoop
> with many different sub-projects included in it.
>
> I am also a bit concerned about these sub-projects becoming second class
> citizens, where we break something, but because the build is off by default
> we don't know it.  I would prefer that they are built and tested by
> default.  If the build and test time starts to take too long, to me that
> means we need to start wondering if we have too many contrib modules.
>
> --Bobby
>
> From: Brian Enochson <brian.enochson@gmail.com<mailto:
> brian.enochson@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>" <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>" <user@storm.incubator.apache.org<mailto:
> user@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>"
> <de...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
>
> hi,
>    I am in agreement with Taylor and believe I understand his intent. An
> incredible tool/framework/application like Storm is only enhanced and gains
> value from the number of well maintained and vetted modules that can be
> used for integration and adding further functionality.
>   I am relatively new to the Storm community but have spent quite some
> time reviewing contributing modules out there, reviewing various duplicates
> and running into some version incompatibilities. I understand the need to
> keep Storm itself pure, but do think there needs to be some structure and
> governance added to the contributing modules. Look at the benefit a tool
> like npm brings to the node community.
>   I like the idea of sponsorship, vetting and a community vote.  I, as
> sure many would be, am willing to offer support and time to working through
> how to set this up and helping with the implementation if it is decided to
> pursue some solution.
>   I hope these views are taken in the sprit they are made, to make this
> incredible system even better along with the surrounding eco-system.
>
> Thanks,
> Brian
>
>
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com>> wrote:
> Just to be clear (and play a little Devil's advocate :) ), I'm not
> suggesting that whatever a "contrib" project/module/subproject might
>  become, be a clearinghouse for anything Storm-related.
>
> I see it as something that is well-vetted by the Storm community, subject
> to PPMC review, vote, etc. Entry would require community review, PPMC
> review, and in some cases ASF IP clearance/legal review. Anything added
> would require some level of commitment from the PPMC/committers to provide
> some level of support.
>
> In other words, nothing "willy-nilly".
>
> One option could be that any module added require (X > 0)  number of
> committers to volunteer as "sponsor"s for the module, and commit to
> maintaining it.
>
> That being said, I don't see storm-kafka being any different from anything
> else that provides integration points for Storm.
>
> -Taylor
>
>
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <nathan@nathanmarz.com<mailto:
> nathan@nathanmarz.com>> wrote:
>
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put
> these contrib modules in a "contrib" folder and keep them managed as
> completely separate codebases. As it's not actually a "module" necessary
> for Storm, there's an argument there for doing it that way rather than via
> the multi-module route.
>
>
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mpathira@umail.iu.edu
> <ma...@umail.iu.edu>> wrote:
> Hi Taylor,
>
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
>
> Thanks
> Milinda
>
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <ptgoetz@gmail.com
> <ma...@gmail.com>> wrote:
> > A while back I opened STORM-206 [1] to capture ideas for pulling in
> > "contrib" modules to the Apache codebase.
> >
> > In the past, we had the storm-contrib github project [2] which
> subsequently
> > got broken up into individual projects hosted on the stormprocessor
> github
> > group [3] and elsewhere.
> >
> > The problem with this approach is that in certain cases it led to code
> rot
> > (modules not being updated in step with Storm's API), fragmentation
> > (multiple similar modules with the same name), and confusion.
> >
> > A good example of this is the storm-kafka module [4], since it is a
> widely
> > used component. Because storm-contrib wasn't being tagged in github, a
> lot
> > of users had trouble reconciling with which versions of storm it was
> > compatible. Some users built off specific commit hashes, some forked,
> and a
> > few even pushed custom builds to repositories such as clojars. With kafka
> > 0.8 now available, there are two main storm-kafka projects, the original
> > (compatible with kafka 0.7) and an updated fork [5] (compatible with
> kafka
> > 0.8).
> >
> > My intention is not to find fault in any way, but rather to point out the
> > resulting pain, and work toward a better solution.
> >
> > I think it would be beneficial to the Storm user community to have
> certain
> > commonly used modules like storm-kafka brought into the Apache Storm
> > project. Another benefit worth considering is the licensing/legal
> oversight
> > that the ASF provides, which is important to many users.
> >
> > If this is something we want to do, then the big question becomes what
> sort
> > governance process needs to be established to ensure that such things are
> > properly maintained.
> >
> > Some random thoughts, questions, etc. that jump to mind include:
> >
> > What to call these things: "contib modules", "connectors", "integration
> > modules", etc.?
> > Build integration: I imagine they would be a multi-module submodule of
> the
> > main maven build. Probably turned off by default and enabled by a maven
> > profile.
> > Governance: Have one or more committer volunteers responsible for
> > maintenance, merging patches, etc.? Proposal process for pulling new
> > modules?
> >
> >
> > I look forward to hearing others' opinions.
> >
> > - Taylor
> >
> >
> > [1] https://issues.apache.org/jira/browse/STORM-206
> > [2] https://github.com/nathanmarz/storm-contrib
> > [3] https://github.com/stormprocessor
> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>
>
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Thanks for the feedback Bobby.

To clarify, I’m mainly talking about spout/bolt/trident state implementations that integrate storm with *Technology X*, where *Technology X* is not a fundamental part of storm. 

Examples would be technologies that are part of or related to the Hadoop/Big Data ecosystem and enable the Lamda Architecture, e.g.: Kafka, HDFS, HBase, Cassandra, etc.

The idea behind having one or more Storm committers act as a “sponsor” is to make sure new additions are done carefully and with good reason. To add a new module, it would require committer/PPMC consensus, and assignment of one or more sponsors. Part of a sponsor’s job would be to ensure that a module is maintained, which would require enough familiarity with the code so support it long term. If a new module was proposed, but no committers were willing to act as a sponsor, it would not be added.

It would be the Committers’/PPMC’s responsibly to make sure things didn’t get out of hand, and to do something about it if it does.

Here’s an old Hadoop JIRA thread [1] discussing the addition of Hive as a contrib module, similar to what happened with HBase as Bobby pointed out. Some interesting points are brought up. The difference here is that both HBase and Hive were pretty big codebases relative to Hadoop. With spout/bolt/state implementations I doubt we’d see anything along that scale.

- Taylor

[1] https://issues.apache.org/jira/browse/HADOOP-3601


On Feb 26, 2014, at 12:35 PM, Bobby Evans <ev...@yahoo-inc.com> wrote:

> I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.
> 
> I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.
> 
> Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.
> 
> If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.
> 
> I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.
> 
> —Bobby
> 
> From: Brian Enochson <br...@gmail.com>>
> Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Date: Tuesday, February 25, 2014 at 9:50 PM
> To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
> Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
> Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache
> 
> hi,
>   I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
>  I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
>  I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
>  I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.
> 
> Thanks,
> Brian
> 
> 
> On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.
> 
> I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.
> 
> In other words, nothing “willy-nilly”.
> 
> One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.
> 
> That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.
> 
> -Taylor
> 
> 
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:
> 
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
>> A while back I opened STORM-206 [1] to capture ideas for pulling in
>> "contrib" modules to the Apache codebase.
>> 
>> In the past, we had the storm-contrib github project [2] which subsequently
>> got broken up into individual projects hosted on the stormprocessor github
>> group [3] and elsewhere.
>> 
>> The problem with this approach is that in certain cases it led to code rot
>> (modules not being updated in step with Storm's API), fragmentation
>> (multiple similar modules with the same name), and confusion.
>> 
>> A good example of this is the storm-kafka module [4], since it is a widely
>> used component. Because storm-contrib wasn't being tagged in github, a lot
>> of users had trouble reconciling with which versions of storm it was
>> compatible. Some users built off specific commit hashes, some forked, and a
>> few even pushed custom builds to repositories such as clojars. With kafka
>> 0.8 now available, there are two main storm-kafka projects, the original
>> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
>> 0.8).
>> 
>> My intention is not to find fault in any way, but rather to point out the
>> resulting pain, and work toward a better solution.
>> 
>> I think it would be beneficial to the Storm user community to have certain
>> commonly used modules like storm-kafka brought into the Apache Storm
>> project. Another benefit worth considering is the licensing/legal oversight
>> that the ASF provides, which is important to many users.
>> 
>> If this is something we want to do, then the big question becomes what sort
>> governance process needs to be established to ensure that such things are
>> properly maintained.
>> 
>> Some random thoughts, questions, etc. that jump to mind include:
>> 
>> What to call these things: "contib modules", "connectors", "integration
>> modules", etc.?
>> Build integration: I imagine they would be a multi-module submodule of the
>> main maven build. Probably turned off by default and enabled by a maven
>> profile.
>> Governance: Have one or more committer volunteers responsible for
>> maintenance, merging patches, etc.? Proposal process for pulling new
>> modules?
>> 
>> 
>> I look forward to hearing others' opinions.
>> 
>> - Taylor
>> 
>> 
>> [1] https://issues.apache.org/jira/browse/STORM-206
>> [2] https://github.com/nathanmarz/storm-contrib
>> [3] https://github.com/stormprocessor
>> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>
> 
> 
> 
> --
> Twitter: @nathanmarz
> http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Bobby Evans <ev...@yahoo-inc.com>.

I can see a lot of value in having a distribution of storm that comes with batteries included, everything is tested together and you know it works.  But I don’t see much long term developer benefit in building them all together.  If there is strong coupling between storm and these external projects so that they break when storm changes then we need to understand the coupling and decide if we want to reduce that coupling by stabilizing APIs, improving version numbering and release process, etc.; or if the functionality is something that should be offered as a base service in storm.

I can see politically the value of giving these other projects a home in Apache, and making them sub-projects is the simplest route to that.  I’d love to have storm on yarn inside Apache.  I just don’t want to go overboard with it.  There was a time when HBase was a “contrib” module under Hadoop along with a lot of other things, and the Apache board came and told Hadoop to brake it up.

Bringing storm-kafka into storm does not sound like it will solve much from a developer’s perspective, because there is at least as much coupling with kafka as there is with storm.  I can see how it is a huge amount of overhead and pain to set up a new project just for a few hundred lines of code, as such I am in favor of pulling in closely related projects, especially those that are spouts and state implementations. I just want to be sure that we do it carefully, with a good reason, and with enough people who are familiar with the code to support it long term.

If it starts to look like we are pulling in too many projects perhaps we should look at something more like the bigtop project  https://bigtop.apache.org/ which produces a tested distribution of Hadoop with many different sub-projects included in it.

I am also a bit concerned about these sub-projects becoming second class citizens, where we break something, but because the build is off by default we don’t know it.  I would prefer that they are built and tested by default.  If the build and test time starts to take too long, to me that means we need to start wondering if we have too many contrib modules.

—Bobby

From: Brian Enochson <br...@gmail.com>>
Reply-To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Date: Tuesday, February 25, 2014 at 9:50 PM
To: "user@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <us...@storm.incubator.apache.org>>
Cc: "dev@storm.incubator.apache.org<ma...@storm.incubator.apache.org>" <de...@storm.incubator.apache.org>>
Subject: Re: [DISCUSS] Pulling "Contrib" Modules into Apache

hi,
   I am in agreement with Taylor and believe I understand his intent. An incredible tool/framework/application like Storm is only enhanced and gains value from the number of well maintained and vetted modules that can be used for integration and adding further functionality.
  I am relatively new to the Storm community but have spent quite some time reviewing contributing modules out there, reviewing various duplicates and running into some version incompatibilities. I understand the need to keep Storm itself pure, but do think there needs to be some structure and governance added to the contributing modules. Look at the benefit a tool like npm brings to the node community.
  I like the idea of sponsorship, vetting and a community vote.  I, as sure many would be, am willing to offer support and time to working through how to set this up and helping with the implementation if it is decided to pursue some solution.
  I hope these views are taken in the sprit they are made, to make this incredible system even better along with the surrounding eco-system.

Thanks,
Brian

On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.

-Taylor

On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com>> wrote:

I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.

On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>> wrote:
Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
>
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
>
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
>
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
>
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
>
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
>
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
>
> Some random thoughts, questions, etc. that jump to mind include:
>
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
>
>
> I look forward to hearing others' opinions.
>
> - Taylor
>
>
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus

--
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org<http://milinda.pathirage.org/>

--
Twitter: @nathanmarz
http://nathanmarz.com<http://nathanmarz.com/>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Brian Enochson <br...@gmail.com>.

hi,

   I am in agreement with Taylor and believe I understand his intent. An
incredible tool/framework/application like Storm is only enhanced and gains
value from the number of well maintained and vetted modules that can be
used for integration and adding further functionality.

  I am relatively new to the Storm community but have spent quite some time
reviewing contributing modules out there, reviewing various duplicates and
running into some version incompatibilities. I understand the need to keep
Storm itself pure, but do think there needs to be some structure and
governance added to the contributing modules. Look at the benefit a tool
like npm brings to the node community.

  I like the idea of sponsorship, vetting and a community vote.  I, as sure
many would be, am willing to offer support and time to working through how
to set this up and helping with the implementation if it is decided to
pursue some solution.

  I hope these views are taken in the sprit they are made, to make this
incredible system even better along with the surrounding eco-system.



Thanks,

Brian


On Tue, Feb 25, 2014 at 9:36 PM, P. Taylor Goetz <pt...@gmail.com> wrote:

> Just to be clear (and play a little Devil's advocate :) ), I'm not
> suggesting that whatever a "contrib" project/module/subproject might
>  become, be a clearinghouse for anything Storm-related.
>
> I see it as something that is well-vetted by the Storm community, subject
> to PPMC review, vote, etc. Entry would require community review, PPMC
> review, and in some cases ASF IP clearance/legal review. Anything added
> would require some level of commitment from the PPMC/committers to provide
> some level of support.
>
> In other words, nothing "willy-nilly".
>
> One option could be that any module added require (X > 0)  number of
> committers to volunteer as "sponsor"s for the module, and commit to
> maintaining it.
>
> That being said, I don't see storm-kafka being any different from anything
> else that provides integration points for Storm.
>
> -Taylor
>
>
> On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com> wrote:
>
> I'm only +1 for pulling in storm-kafka and updating it. Other projects put
> these contrib modules in a "contrib" folder and keep them managed as
> completely separate codebases. As it's not actually a "module" necessary
> for Storm, there's an argument there for doing it that way rather than via
> the multi-module route.
>
>
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>wrote:
>
>> Hi Taylor,
>>
>> I'm +1 for pulling these external libraries into Apache codebase. This
>> will certainly benifit Strom community. I also like to contribute to
>> this process.
>>
>> Thanks
>> Milinda
>>
>> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>
>> wrote:
>> > A while back I opened STORM-206 [1] to capture ideas for pulling in
>> > "contrib" modules to the Apache codebase.
>> >
>> > In the past, we had the storm-contrib github project [2] which
>> subsequently
>> > got broken up into individual projects hosted on the stormprocessor
>> github
>> > group [3] and elsewhere.
>> >
>> > The problem with this approach is that in certain cases it led to code
>> rot
>> > (modules not being updated in step with Storm's API), fragmentation
>> > (multiple similar modules with the same name), and confusion.
>> >
>> > A good example of this is the storm-kafka module [4], since it is a
>> widely
>> > used component. Because storm-contrib wasn't being tagged in github, a
>> lot
>> > of users had trouble reconciling with which versions of storm it was
>> > compatible. Some users built off specific commit hashes, some forked,
>> and a
>> > few even pushed custom builds to repositories such as clojars. With
>> kafka
>> > 0.8 now available, there are two main storm-kafka projects, the original
>> > (compatible with kafka 0.7) and an updated fork [5] (compatible with
>> kafka
>> > 0.8).
>> >
>> > My intention is not to find fault in any way, but rather to point out
>> the
>> > resulting pain, and work toward a better solution.
>> >
>> > I think it would be beneficial to the Storm user community to have
>> certain
>> > commonly used modules like storm-kafka brought into the Apache Storm
>> > project. Another benefit worth considering is the licensing/legal
>> oversight
>> > that the ASF provides, which is important to many users.
>> >
>> > If this is something we want to do, then the big question becomes what
>> sort
>> > governance process needs to be established to ensure that such things
>> are
>> > properly maintained.
>> >
>> > Some random thoughts, questions, etc. that jump to mind include:
>> >
>> > What to call these things: "contib modules", "connectors", "integration
>> > modules", etc.?
>> > Build integration: I imagine they would be a multi-module submodule of
>> the
>> > main maven build. Probably turned off by default and enabled by a maven
>> > profile.
>> > Governance: Have one or more committer volunteers responsible for
>> > maintenance, merging patches, etc.? Proposal process for pulling new
>> > modules?
>> >
>> >
>> > I look forward to hearing others' opinions.
>> >
>> > - Taylor
>> >
>> >
>> > [1] https://issues.apache.org/jira/browse/STORM-206
>> > [2] https://github.com/nathanmarz/storm-contrib
>> > [3] https://github.com/stormprocessor
>> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
>> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>>
>>
>>
>> --
>> Milinda Pathirage
>>
>> PhD Student | Research Assistant
>> School of Informatics and Computing | Data to Insight Center
>> Indiana University
>>
>> twitter: milindalakmal
>> skype: milinda.pathirage
>> blog: http://milinda.pathirage.org
>>
>
>
>
> --
> Twitter: @nathanmarz
> http://nathanmarz.com
>
>
>

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.

-Taylor


On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com> wrote:

> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> > A while back I opened STORM-206 [1] to capture ideas for pulling in
> > "contrib" modules to the Apache codebase.
> >
> > In the past, we had the storm-contrib github project [2] which subsequently
> > got broken up into individual projects hosted on the stormprocessor github
> > group [3] and elsewhere.
> >
> > The problem with this approach is that in certain cases it led to code rot
> > (modules not being updated in step with Storm's API), fragmentation
> > (multiple similar modules with the same name), and confusion.
> >
> > A good example of this is the storm-kafka module [4], since it is a widely
> > used component. Because storm-contrib wasn't being tagged in github, a lot
> > of users had trouble reconciling with which versions of storm it was
> > compatible. Some users built off specific commit hashes, some forked, and a
> > few even pushed custom builds to repositories such as clojars. With kafka
> > 0.8 now available, there are two main storm-kafka projects, the original
> > (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> > 0.8).
> >
> > My intention is not to find fault in any way, but rather to point out the
> > resulting pain, and work toward a better solution.
> >
> > I think it would be beneficial to the Storm user community to have certain
> > commonly used modules like storm-kafka brought into the Apache Storm
> > project. Another benefit worth considering is the licensing/legal oversight
> > that the ASF provides, which is important to many users.
> >
> > If this is something we want to do, then the big question becomes what sort
> > governance process needs to be established to ensure that such things are
> > properly maintained.
> >
> > Some random thoughts, questions, etc. that jump to mind include:
> >
> > What to call these things: "contib modules", "connectors", "integration
> > modules", etc.?
> > Build integration: I imagine they would be a multi-module submodule of the
> > main maven build. Probably turned off by default and enabled by a maven
> > profile.
> > Governance: Have one or more committer volunteers responsible for
> > maintenance, merging patches, etc.? Proposal process for pulling new
> > modules?
> >
> >
> > I look forward to hearing others' opinions.
> >
> > - Taylor
> >
> >
> > [1] https://issues.apache.org/jira/browse/STORM-206
> > [2] https://github.com/nathanmarz/storm-contrib
> > [3] https://github.com/stormprocessor
> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by "P. Taylor Goetz" <pt...@gmail.com>.

Just to be clear (and play a little Devil’s advocate :) ), I’m not suggesting that whatever a “contrib” project/module/subproject might  become, be a clearinghouse for anything Storm-related.

I see it as something that is well-vetted by the Storm community, subject to PPMC review, vote, etc. Entry would require community review, PPMC review, and in some cases ASF IP clearance/legal review. Anything added would require some level of commitment from the PPMC/committers to provide some level of support.

In other words, nothing “willy-nilly”.

One option could be that any module added require (X > 0)  number of committers to volunteer as “sponsor”s for the module, and commit to maintaining it.

That being said, I don’t see storm-kafka being any different from anything else that provides integration points for Storm.

-Taylor


On Feb 25, 2014, at 7:53 PM, Nathan Marz <na...@nathanmarz.com> wrote:

> I'm only +1 for pulling in storm-kafka and updating it. Other projects put these contrib modules in a "contrib" folder and keep them managed as completely separate codebases. As it's not actually a "module" necessary for Storm, there's an argument there for doing it that way rather than via the multi-module route.
> 
> 
> On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu> wrote:
> Hi Taylor,
> 
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
> 
> Thanks
> Milinda
> 
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> > A while back I opened STORM-206 [1] to capture ideas for pulling in
> > "contrib" modules to the Apache codebase.
> >
> > In the past, we had the storm-contrib github project [2] which subsequently
> > got broken up into individual projects hosted on the stormprocessor github
> > group [3] and elsewhere.
> >
> > The problem with this approach is that in certain cases it led to code rot
> > (modules not being updated in step with Storm's API), fragmentation
> > (multiple similar modules with the same name), and confusion.
> >
> > A good example of this is the storm-kafka module [4], since it is a widely
> > used component. Because storm-contrib wasn't being tagged in github, a lot
> > of users had trouble reconciling with which versions of storm it was
> > compatible. Some users built off specific commit hashes, some forked, and a
> > few even pushed custom builds to repositories such as clojars. With kafka
> > 0.8 now available, there are two main storm-kafka projects, the original
> > (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> > 0.8).
> >
> > My intention is not to find fault in any way, but rather to point out the
> > resulting pain, and work toward a better solution.
> >
> > I think it would be beneficial to the Storm user community to have certain
> > commonly used modules like storm-kafka brought into the Apache Storm
> > project. Another benefit worth considering is the licensing/legal oversight
> > that the ASF provides, which is important to many users.
> >
> > If this is something we want to do, then the big question becomes what sort
> > governance process needs to be established to ensure that such things are
> > properly maintained.
> >
> > Some random thoughts, questions, etc. that jump to mind include:
> >
> > What to call these things: "contib modules", "connectors", "integration
> > modules", etc.?
> > Build integration: I imagine they would be a multi-module submodule of the
> > main maven build. Probably turned off by default and enabled by a maven
> > profile.
> > Governance: Have one or more committer volunteers responsible for
> > maintenance, merging patches, etc.? Proposal process for pulling new
> > modules?
> >
> >
> > I look forward to hearing others' opinions.
> >
> > - Taylor
> >
> >
> > [1] https://issues.apache.org/jira/browse/STORM-206
> > [2] https://github.com/nathanmarz/storm-contrib
> > [3] https://github.com/stormprocessor
> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
> 
> 
> 
> --
> Milinda Pathirage
> 
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
> 
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
> 
> 
> 
> -- 
> Twitter: @nathanmarz
> http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

I'm only +1 for pulling in storm-kafka and updating it. Other projects put
these contrib modules in a "contrib" folder and keep them managed as
completely separate codebases. As it's not actually a "module" necessary
for Storm, there's an argument there for doing it that way rather than via
the multi-module route.


On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>wrote:

> Hi Taylor,
>
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
>
> Thanks
> Milinda
>
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>
> wrote:
> > A while back I opened STORM-206 [1] to capture ideas for pulling in
> > "contrib" modules to the Apache codebase.
> >
> > In the past, we had the storm-contrib github project [2] which
> subsequently
> > got broken up into individual projects hosted on the stormprocessor
> github
> > group [3] and elsewhere.
> >
> > The problem with this approach is that in certain cases it led to code
> rot
> > (modules not being updated in step with Storm's API), fragmentation
> > (multiple similar modules with the same name), and confusion.
> >
> > A good example of this is the storm-kafka module [4], since it is a
> widely
> > used component. Because storm-contrib wasn't being tagged in github, a
> lot
> > of users had trouble reconciling with which versions of storm it was
> > compatible. Some users built off specific commit hashes, some forked,
> and a
> > few even pushed custom builds to repositories such as clojars. With kafka
> > 0.8 now available, there are two main storm-kafka projects, the original
> > (compatible with kafka 0.7) and an updated fork [5] (compatible with
> kafka
> > 0.8).
> >
> > My intention is not to find fault in any way, but rather to point out the
> > resulting pain, and work toward a better solution.
> >
> > I think it would be beneficial to the Storm user community to have
> certain
> > commonly used modules like storm-kafka brought into the Apache Storm
> > project. Another benefit worth considering is the licensing/legal
> oversight
> > that the ASF provides, which is important to many users.
> >
> > If this is something we want to do, then the big question becomes what
> sort
> > governance process needs to be established to ensure that such things are
> > properly maintained.
> >
> > Some random thoughts, questions, etc. that jump to mind include:
> >
> > What to call these things: "contib modules", "connectors", "integration
> > modules", etc.?
> > Build integration: I imagine they would be a multi-module submodule of
> the
> > main maven build. Probably turned off by default and enabled by a maven
> > profile.
> > Governance: Have one or more committer volunteers responsible for
> > maintenance, merging patches, etc.? Proposal process for pulling new
> > modules?
> >
> >
> > I look forward to hearing others' opinions.
> >
> > - Taylor
> >
> >
> > [1] https://issues.apache.org/jira/browse/STORM-206
> > [2] https://github.com/nathanmarz/storm-contrib
> > [3] https://github.com/stormprocessor
> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Nathan Marz <na...@nathanmarz.com>.

I'm only +1 for pulling in storm-kafka and updating it. Other projects put
these contrib modules in a "contrib" folder and keep them managed as
completely separate codebases. As it's not actually a "module" necessary
for Storm, there's an argument there for doing it that way rather than via
the multi-module route.


On Tue, Feb 25, 2014 at 4:39 PM, Milinda Pathirage <mp...@umail.iu.edu>wrote:

> Hi Taylor,
>
> I'm +1 for pulling these external libraries into Apache codebase. This
> will certainly benifit Strom community. I also like to contribute to
> this process.
>
> Thanks
> Milinda
>
> On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com>
> wrote:
> > A while back I opened STORM-206 [1] to capture ideas for pulling in
> > "contrib" modules to the Apache codebase.
> >
> > In the past, we had the storm-contrib github project [2] which
> subsequently
> > got broken up into individual projects hosted on the stormprocessor
> github
> > group [3] and elsewhere.
> >
> > The problem with this approach is that in certain cases it led to code
> rot
> > (modules not being updated in step with Storm's API), fragmentation
> > (multiple similar modules with the same name), and confusion.
> >
> > A good example of this is the storm-kafka module [4], since it is a
> widely
> > used component. Because storm-contrib wasn't being tagged in github, a
> lot
> > of users had trouble reconciling with which versions of storm it was
> > compatible. Some users built off specific commit hashes, some forked,
> and a
> > few even pushed custom builds to repositories such as clojars. With kafka
> > 0.8 now available, there are two main storm-kafka projects, the original
> > (compatible with kafka 0.7) and an updated fork [5] (compatible with
> kafka
> > 0.8).
> >
> > My intention is not to find fault in any way, but rather to point out the
> > resulting pain, and work toward a better solution.
> >
> > I think it would be beneficial to the Storm user community to have
> certain
> > commonly used modules like storm-kafka brought into the Apache Storm
> > project. Another benefit worth considering is the licensing/legal
> oversight
> > that the ASF provides, which is important to many users.
> >
> > If this is something we want to do, then the big question becomes what
> sort
> > governance process needs to be established to ensure that such things are
> > properly maintained.
> >
> > Some random thoughts, questions, etc. that jump to mind include:
> >
> > What to call these things: "contib modules", "connectors", "integration
> > modules", etc.?
> > Build integration: I imagine they would be a multi-module submodule of
> the
> > main maven build. Probably turned off by default and enabled by a maven
> > profile.
> > Governance: Have one or more committer volunteers responsible for
> > maintenance, merging patches, etc.? Proposal process for pulling new
> > modules?
> >
> >
> > I look forward to hearing others' opinions.
> >
> > - Taylor
> >
> >
> > [1] https://issues.apache.org/jira/browse/STORM-206
> > [2] https://github.com/nathanmarz/storm-contrib
> > [3] https://github.com/stormprocessor
> > [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> > [5] https://github.com/wurstmeister/storm-kafka-0.8-plus
>
>
>
> --
> Milinda Pathirage
>
> PhD Student | Research Assistant
> School of Informatics and Computing | Data to Insight Center
> Indiana University
>
> twitter: milindalakmal
> skype: milinda.pathirage
> blog: http://milinda.pathirage.org
>



-- 
Twitter: @nathanmarz
http://nathanmarz.com

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
>
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
>
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
>
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
>
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
>
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
>
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
>
> Some random thoughts, questions, etc. that jump to mind include:
>
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
>
>
> I look forward to hearing others' opinions.
>
> - Taylor
>
>
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org

Re: [DISCUSS] Pulling "Contrib" Modules into Apache

Posted by Milinda Pathirage <mp...@umail.iu.edu>.

Hi Taylor,

I'm +1 for pulling these external libraries into Apache codebase. This
will certainly benifit Strom community. I also like to contribute to
this process.

Thanks
Milinda

On Tue, Feb 25, 2014 at 5:28 PM, P. Taylor Goetz <pt...@gmail.com> wrote:
> A while back I opened STORM-206 [1] to capture ideas for pulling in
> "contrib" modules to the Apache codebase.
>
> In the past, we had the storm-contrib github project [2] which subsequently
> got broken up into individual projects hosted on the stormprocessor github
> group [3] and elsewhere.
>
> The problem with this approach is that in certain cases it led to code rot
> (modules not being updated in step with Storm's API), fragmentation
> (multiple similar modules with the same name), and confusion.
>
> A good example of this is the storm-kafka module [4], since it is a widely
> used component. Because storm-contrib wasn't being tagged in github, a lot
> of users had trouble reconciling with which versions of storm it was
> compatible. Some users built off specific commit hashes, some forked, and a
> few even pushed custom builds to repositories such as clojars. With kafka
> 0.8 now available, there are two main storm-kafka projects, the original
> (compatible with kafka 0.7) and an updated fork [5] (compatible with kafka
> 0.8).
>
> My intention is not to find fault in any way, but rather to point out the
> resulting pain, and work toward a better solution.
>
> I think it would be beneficial to the Storm user community to have certain
> commonly used modules like storm-kafka brought into the Apache Storm
> project. Another benefit worth considering is the licensing/legal oversight
> that the ASF provides, which is important to many users.
>
> If this is something we want to do, then the big question becomes what sort
> governance process needs to be established to ensure that such things are
> properly maintained.
>
> Some random thoughts, questions, etc. that jump to mind include:
>
> What to call these things: "contib modules", "connectors", "integration
> modules", etc.?
> Build integration: I imagine they would be a multi-module submodule of the
> main maven build. Probably turned off by default and enabled by a maven
> profile.
> Governance: Have one or more committer volunteers responsible for
> maintenance, merging patches, etc.? Proposal process for pulling new
> modules?
>
>
> I look forward to hearing others' opinions.
>
> - Taylor
>
>
> [1] https://issues.apache.org/jira/browse/STORM-206
> [2] https://github.com/nathanmarz/storm-contrib
> [3] https://github.com/stormprocessor
> [4] https://github.com/nathanmarz/storm-contrib/tree/master/storm-kafka
> [5] https://github.com/wurstmeister/storm-kafka-0.8-plus



-- 
Milinda Pathirage

PhD Student | Research Assistant
School of Informatics and Computing | Data to Insight Center
Indiana University

twitter: milindalakmal
skype: milinda.pathirage
blog: http://milinda.pathirage.org