You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by Dorian Cransac <dc...@gmail.com> on 2015/03/19 14:16:56 UTC
Booting short-lived topologies frequently and avoiding jar files

Hi,

I'm currently experimenting with Storm in order to figure out whether it is
the right fit for my project, and I would like to seek other user's
opinions on this, as the tests I'm currently doing are getting costlier and
costlier (I'm now working on setting up a full scale cluster and trying to
deploy topologies on it to measure deployment time).

My project relies heavily on parallel computation of map/reduce patterns.
However, I'm not interested in having an infinite lifespan for each of my
topologies as in practice, I'm working with bounded streams of data (thus
my topologies are relatively ephemeral). This is the main (obvious) reason
why I'm doubting my approach here.

However, each topology will handle tons and tons of tuples and there are
many features in Storm that would be valuable to me, which is why I'm
trying to see if my cube would somehow fit into this round hole, and how
far it is from fitting :
- the clean scalability and worker flexibility (re-balancing)
- reliability of message processing (acks) & exactly one semantics
- the clean API and interfaces for setting up my topology, and how the
abstractions fit together
- the apparently low overhead resulting from these features (haven't tested
that thoroughly first hand but I guess I'm believing social proof)

Now, a couple of important aspects which make me think there might be hope
here :
- in my primary use case, even though I would be deploying and killing
topologies frequently, the *implementation* of the spouts and bolts
*wouldn't
change*. This means, the java code essentially doesn't change, except for
the part where I'm connecting the legos together (the topology object
itself, I guess). Which I don't believe would necessarily be a big deal to
deploy dynamically (is the new jar mendatory or can't we just serialize the
topology and ship it?).
- I'm not interested in *modifying* a topology dynamically. If a given
topology doesn't work for what I'm trying to do, I'm willing to pay the
cost of deploying a new one. There might be a point where I would look into
caching the topologies (i.e letting them run longer / forever) which match
my most frequent use-cases, but that'd just be optimizing.

So at this point I'm seeing two problems standing in the way and resulting
from this potential "misuse" :

1- the boot-time of a topology (not the initial boot time which includes
starting up zookeeper and everything, just the deployment and start-up time
of each topology, i.e *time-to-first-tuple*)
2- high parallelism of topologies themselves (not just the tasks, but the
overhead of managing them) : could that be a problem, say, for nimbus? does
it scale well when constantly turning on and killing topologies?

To test the boot-time, I'm currently working on a test where I'll
programmatically deploy topologies  on a large cluster and see what the
time penalty is (I'm afraid of the jar part mostly). And then I'll automate
that and see if the frequent kill/deploys disturb the system.

I'm wondering if the answer here is "just use hadoop" which I haven't
played with yet (in this case : are hadoop deployment or boot times
better?). In any case, any input or feedback (preferably based on
experience) is greatly appreciated.