You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Sandon Jacobs <sj...@appia.com> on 2015/04/28 14:49:40 UTC

Project Structure, Packaging, and Deployment

My company is using storm for various stream-processing solutions, mostly ingesting data from Kafka topics. We have chosen to implement our topologies in Scala, using APIs like Tormenta and Summingbird in the mix as well. We have about 9-10 topologies running in production as we speak.

I find tons of useful information about Storm in general, but VERY little about how folks are managing the deployment, git repos, etc.

Currently we have all of these topologies in the same GIT repo, with a main-class for each topology, allowing us to run them locally or remotely. Some of this code shares common components - we try to reuse some bolts we have written, and other dependencies cross topologies as well. 

So in our CI environment, we build an assembly jar using SBT containing all topologies and use storm jar command to deploy that jar N-times (N = number of topologies). We have functional tests that are run by Jenkins after each topology deployment to exercise the functionality of said topology. Given the number of topologies in our catalog, this is starting to become cumbersome in the current state, with the feedback loop from git push thru deployment-test getting longer and more unwieldy. The whole thing is starting to remind me too much of my Java EE container days with multiple EAR files or WAR files deployed in a cluster of WebSphere boxes (UGH!!!).

I say all of that to frame the question of how folks are managing a similar situations/deployments. There has been some thought around breaking up the git repo into multiple repos. Or maybe a git repo with a parent SBT project, with subproject(s) for common components and 1 subproject per topology.

I am interested to hear any thoughts or be pointed to any resources that have been helpful to others. 

Re: Project Structure, Packaging, and Deployment

Posted by Mason Yu <co...@gmail.com>.
GM Sandon:

           I can empathize with your challenges.   In a previous lifetime,
I worked with Websphere 2.0 through 8.X, portals, process server, app
server.  One installation had over 500+ WAS instances.   Pain X 10.
           Distributed computing is a sword with two edges.  Scaling
is terrific when all the moving parts are in sync.  Most everyone loves
to tout performance.  The down side is maintenance, debugging
and as you put it, deployment.
           Keeping multiple topologies, bolts, etc for different environment
is a logistical nightmare.  Even with Puppet and Chef handling the
mundane tasks of keeping the Linux binaries and directories in sync,
the "Storm" layer becomes ever more complex.   You did not
mention Zookeeper in your discussion.
           Have you thought of pushing this out to AWS or Google
Cloud ???  Assuming you are doing real-time micro-batching,
you might have several test topologies(work in progress).  You
might have a QA and the production topology to mirror each
other.  Split the RT traffic to QA and to prod.   If you have
a dozen or so topologies, how much Bash scripting do you need ???
TOO MUCH sys admin work !!!
           Again, best wishes on your endeavors.

           Mason Yu Jr.
           Principal Enterprise Architect
           Big Data Architects, LLC.




著名的孫子

On Tue, Apr 28, 2015 at 8:49 AM, Sandon Jacobs <sj...@appia.com> wrote:

> My company is using storm for various stream-processing solutions, mostly
> ingesting data from Kafka topics. We have chosen to implement our
> topologies in Scala, using APIs like Tormenta and Summingbird in the mix as
> well. We have about 9-10 topologies running in production as we speak.
>
> I find tons of useful information about Storm in general, but VERY little
> about how folks are managing the deployment, git repos, etc.
>
> Currently we have all of these topologies in the same GIT repo, with a
> main-class for each topology, allowing us to run them locally or remotely.
> Some of this code shares common components - we try to reuse some bolts we
> have written, and other dependencies cross topologies as well.
>
> So in our CI environment, we build an assembly jar using SBT containing
> all topologies and use storm jar command to deploy that jar N-times (N =
> number of topologies). We have functional tests that are run by Jenkins
> after each topology deployment to exercise the functionality of said
> topology. Given the number of topologies in our catalog, this is starting
> to become cumbersome in the current state, with the feedback loop from git
> push thru deployment-test getting longer and more unwieldy. The whole thing
> is starting to remind me too much of my Java EE container days with
> multiple EAR files or WAR files deployed in a cluster of WebSphere boxes
> (UGH!!!).
>
> I say all of that to frame the question of how folks are managing a
> similar situations/deployments. There has been some thought around breaking
> up the git repo into multiple repos. Or maybe a git repo with a parent SBT
> project, with subproject(s) for common components and 1 subproject per
> topology.
>
> I am interested to hear any thoughts or be pointed to any resources that
> have been helpful to others.

Re: Project Structure, Packaging, and Deployment

Posted by Nathan Leung <nc...@gmail.com>.
If you prefer not to use a separate main class for each topology you can
try something like Flux: https://github.com/ptgoetz/flux.  Note that it is
a work in progress.  You will still need to have a yaml document for each
topology.

With regards to code hierarchy, I find it good to have common libraries in
a separate project or sub project.  Related topologies that are part of the
same high level project can go in the same jar; this also lets you more
easily reuse components like bolts with the declarative storm topology.
Topologies that are less related or unrelated can go in a separate jar.
On Apr 28, 2015 8:50 AM, "Sandon Jacobs" <sj...@appia.com> wrote:

> My company is using storm for various stream-processing solutions, mostly
> ingesting data from Kafka topics. We have chosen to implement our
> topologies in Scala, using APIs like Tormenta and Summingbird in the mix as
> well. We have about 9-10 topologies running in production as we speak.
>
> I find tons of useful information about Storm in general, but VERY little
> about how folks are managing the deployment, git repos, etc.
>
> Currently we have all of these topologies in the same GIT repo, with a
> main-class for each topology, allowing us to run them locally or remotely.
> Some of this code shares common components - we try to reuse some bolts we
> have written, and other dependencies cross topologies as well.
>
> So in our CI environment, we build an assembly jar using SBT containing
> all topologies and use storm jar command to deploy that jar N-times (N =
> number of topologies). We have functional tests that are run by Jenkins
> after each topology deployment to exercise the functionality of said
> topology. Given the number of topologies in our catalog, this is starting
> to become cumbersome in the current state, with the feedback loop from git
> push thru deployment-test getting longer and more unwieldy. The whole thing
> is starting to remind me too much of my Java EE container days with
> multiple EAR files or WAR files deployed in a cluster of WebSphere boxes
> (UGH!!!).
>
> I say all of that to frame the question of how folks are managing a
> similar situations/deployments. There has been some thought around breaking
> up the git repo into multiple repos. Or maybe a git repo with a parent SBT
> project, with subproject(s) for common components and 1 subproject per
> topology.
>
> I am interested to hear any thoughts or be pointed to any resources that
> have been helpful to others.